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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels - driving service 
quality below acceptable levels - when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 

A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 
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Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-l" (among other terms) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-l MUD Board," immediately 
following this Summary. A more complete understanding of its implementation may be attained 
by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to, the 
following: 

• hardware and/or software architectures (and methods of operation thereof) for 
multi-user detection in wireless communications systems and particularly, for 
example, in a wireless communications base station; 

. a hardware architecture (and methods of operation thereof) for multi-user 

detection in wireless communications systems pairing each processing node with 
NVRAM and watchdog PLD for fault management; 

♦ methods and apparatus for connecting watchdog PLDs with an out-of-band fault- 
management bus; 

. methods and apparatus for use of an embedded host with the RACEway™ 
architecture of Mercury Computer Systems, Inc. 

. methods and apparatus for interfacing a digital signal processor to the 
RACEway™ architecture; 
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methods and apparatus for interfacing the RACE way™ architecture to a 
programming port in a device for multi-user detection in wireless communications 
systems; 

methods and apparatus for implementing a DMA Engine FPGA for use in multi- 
user detection in a wireless communications systems; 

methods and apparatus for implementing a hardware-based reset voter and stop 
voter; 

methods and apparatus for scalable mapping of handset and BTS functions to 
multiple processors; 

methods and apparatus for facilitating allocation and management of buffers for 
interconnecting processors that implement the aforementioned mapping; 

methods and apparatus for implementing a hybrid operating system, e.g., with the 
Vx Works operating system (of WindRiver Systems, Inc.) on a host computer and 
the MC/OS operating system on RACE®-based nodes. (Race and MC/OS are 
trademarks of Mercury Computer Systems, Inc.); 

methods and apparatus for high-availability multi-user detection in wireless 
communications systems, including (by way of non-limiting example) round- 
robin fault testing and use of NVRAM to store fault symptoms and use of master 
to diagnose faults from NVRAM contents; 

class library-based methods and apparatus for facilitating interprocessor 
communications, by way of non-limiting example, in buffering for multi-user 
detection in wireless communications systems; 
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• methods and apparatus for implementation of R-matrix, gamma-matrix and MPIC 
computations on separate processors in a device for multi-user detection in 
wireless communications systems; 

methods and apparatus for computing complementary R-matrix elements in 
parallel using multiple processors in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for depositing results of R-matrix calculations 
contiguously in memory in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for increasing the number of MPIC and R-matrix 
calculations performed in cache in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for performing a gamma-matrix calculation in FPGA in a 
device for multi-user detection in wireless communications systems; 

methods and apparatus for equalizing load of R-matrix-element calculation 
among multiple processors in a device for multi-user detection in wireless 
communications systems; and 

• methods and apparatus for use of Altivec registers and instruction set in 
performing MUD calculations in a wireless communications system. 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 



5 



Page No. 5 



EV 093 931 868 US 
Page No. 32 



Detailed Description of the Invention 



(see attached materials) 
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42 

43 1 Purpose 

44 The purpose of this document is to describe the software architecture of the 

45 MCW-1 board. The MCW-1 application is a digital signal processing application 

46 that performs interference cancellation for a cellular base station modem board. 

47 >The software project consists of 3 major parts: 

48 • Support for the custom MCW- 1 board being designed by the Wireless 

49 Communications Group hardware department. This consists of porting the 

50 existing host (VxWorks) and multicomputer (MC/OS) software to the board, 

5 1 and adding code to support specialized features of the board such as LED 

52 control, voltage monitoring, hardware watchdogs, etc. 

53 • Increasing the MTBF of the system by addition of high availability software. 

54 This software includes monitoring features such as watchdogs, fault 

55 detection/repair algorithms, and remote software download. 

56 • Implementation of the application software. This includes optimal 

57 implementation of the MUD algorithms, as well as implementing degraded 

58 versions of the algorithm that can be executed when some of the 

59 computational hardware is unavailable due to failures. 

61 Detailed information on the design of new software for the MCW-1 board can 

62 be found in the appropriate functional design documents, which are listed in the 

63 References section of this document. 

64 2 Glossary 

65 

66 1. MTBF - Mean Time Between Failures 

67 2. MUD - Multi User Detection. A class of algorithms to detect multiple 

68 interference sources and remove those effects from the signal. 

69 3 . Multicomputer - a parallel computer which achieves it' s increase in performance 

70 by having more than one CPU working on the application simultaneously. 

7 1 4. VxWorks - a proprietary real time operating system sold by Wind River, Inc. 

72 3 Application Execution Environment 

73 3.1 Overview 

74 The purpose of the MUD application is to input raw antenna data from the base 

75 station modem card, detect sources of interference, produce a new stream of data 

76 which has had interference removed, and then output the data to the modem card 

77 for further processing. 

78 Characteristics of this processing aye-are that it must have low latency (< 300 

79 microseconds), ami-must deal with large amounts of data (> 1 10 million bytes of 

80 data per second), and must be very reliable. 

81 The Mercury computer system is well suited to this kind of signal processing, 

82 exhibiting both very low latencies and high bandwidths. 
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83 The system hardware and software were not designed with high availability as 

84 a goal, so reliability is in line with other standard computer systems designed for 

85 commercial applications 

86 Input data flows from the Modem Motherboard, over the PCI bus, through the 

87 PXB++ bridge, onto the fabric, through the crossbar, and into the memory of the 

88 computing elements. Output data flows in the opposite direction. Some data will 

89 also flow between the 8240 Host CPU and the compute elements, via a similar 

90 pathway, i.e. from the PCI bus through the PXB++ and thus onto the fabric. 

91 Although the software tries to treat the system as if the hardware were 

92 symmetric, as can be seen in the following figure, the host 8240 CPU is attached 

93 via the PCI bus, not directly to the fabric. 
94 

95 | Error! Not a valid link. 

96 Figure 1 

97 3.2 Operating System 

98 MC/OS was selected as the operating system for the MCW-1 board because it 

99 provides the low latencies and high I/O and IPC bandwidths required for these 

1 00 sorts of algorithms, and also because it already provides support for most of the 

101 hardware being incorporated on the MCW-1 board. 

102 The MUD application can be kept as portable as possible by minimizing the 

1 03 use of non-POSIX MC/OS system calls, and encapsulating calls into proprietary 

104 MC/OS interfaces such as DX. 

105 MC/OS requires the presence of a host computer system, which in this case 

1 06 will be a Motorola 8240 PowerPC processor running the Vx Works operating 

107 system. 
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3.3 IPC 



The MC/OS DX subsystem will be used for IPC within the application. This 
API provides low overhead, low latency access to the Mercury DMA engines, 
which in turn provide high bandwidth transfers of data. DX will be used to move 
data between the G4 compute elements during parallel processing, and also will 
be used to move data between the MC/OS compute elements, the VxWorks host 
computer, and the motherboard modem card. 



3.4 no 



Input / Output between the MUD card and the motherboard modem card takes 
place by moving data between the Race++ Fabric and the PCI bus via the PXB++ 
bridge. The application will use DX to initialize the PXB++ bridge, and to cause 
input/output data to move as if it were regular DX IPC traffic. 

Discussions with the customer need to take place in order to determine exactly 
how data flows over the PCI bus. For instance, it is currently unclear who will 
initiate data transfers, and how the initiator will know which PCI addresses should 
be involved in the transfer. A number of meetings with the customer are required 
to resolve these issues. 
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127 3.5 High Availability 

128 The approach to high availability on the MCW-1 card is to do most of the high 

129 availability processing at a time when the application is not running. Specifically, 

1 30 faults are handled by rebooting the system (fairly quickly). When the system 

131 comes up, the application can determine which processing resources are available, 

132 and it is up to the application to determine how to map its processing needs onto 

1 33 the available resources. 

134 This approach to high availability means that there are short interruptions in 

1 35 service, but that the application does not need to know how to continue execution 

1 36 across faults. For instance, the application can make the assumption that the 

1 37 hardware configuration will not change without the system first rebooting. 

138 If the application has state which needs to be preserved across reboots, the 

139 application is responsible for checkpointing the data on a regular basis. The 

140 system software will provide an API to a portion of the non-volatile RAM for this 

141 purpose. It should be noted that the non- volatile RAM is quite small, and that 

142 storage of more than a few hundred bytes of data will require another mechanism 

143 to be put in place. 

144 4 Operating System Environment 

145 4.1 Overview 

146 Mercury Computer Systems, Inc. has historically had the concept of a host 

147 computer system. This dates back to the days when Mercury produced array 

148 processors that were attached to customers' mainframe computers. The evolution 

149 of Mercury multicomputers has left a vestigial host that often performs little more 

1 50 service than as a bootstrap device for the multicomputer. 

1 5 1 The host computer system survives in the MCW-1 design primarily as a way to 

1 52 reduce schedule risk. The existence of a host computer system is assumed in so 

1 53 many ways by the existing Mercury software, that it would add significant 

154 schedule risk to attempt to remove this assumption in the MCW-1 timeframe. 

155 In the MCW-1 board, the host system performs the following functions: 

156 • It configures the Compute Elements, Fabric, and Bridges 

157 • It loads executable code into the Compute Elements 

158 • It serves as a bridge to the TCP/IP internetwork 

159 • It serves as a file system daemon 

160 • It runs some of the application software 

161 • It manages some of the specialized high availability hardware 

162 4.2 Bootstrap 

1 63 The host computer system is based on a Motorola 8240 PowerPC processor on 

164 the MCW-1 board. The 8240 is attached to an amount of linear flash memory. 

165 This flash memory serves several purposes. 

166 The first purpose the flash memory serves is as a source of instructions to 

167 execute when the 8240 comes out of reset. Linear flash is flash which can be 

168 addressed as if it was normal RAM. Flash memories can also be organized to look 

169 like disk controllers; however in that configuration they require a disk driver to 
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170 provide access to the flash memory. Although such an organization has several 

171 benefits such as automatic reallocation of bad flash cells, and write wear leveling, 

172 it is not appropriate for initial bootstrap. 

173 The flash memory also serves as a file system for the host (see Section 4.6), 

174 and as a place to store board permanent information (such as a serial number). 

175 Refer to the function design specification (TBS) for more details on how flash 

1 76 memory is used. 

1 77 When the 8240 first comes out of reset, memory is not turned on. Since high 

178 level languages such as C assume some memory is present (for a stack, for 

179 instance), the initial bootstrap code must be coded in assembler. This assembler 

1 80 bootstrap should only be a few hundred lines of code, sufficient to configure the 

181 memory controller, initialize memory, and initialize the configuration of the 8240 

1 82 internal registers. 

1 83 After the assembler bootstrap has finished execution, control is passed to the 

184 MCW-1 H.A. code (which is also contained in boot flash memory). The purpose 

1 85 of the H.A. code is to attempt to configure the fabric, and load the compute 

1 86 element CPUs with H.A. code. Once this is complete, all the processors 

1 87 participate in the H.A. algorithm. The output of the algorithm is a configuration 

1 88 table which details which hardware is operational and which hardware is not. This 

1 89 is an input to the next stage of bootstrap, the Multicomputer Configuration. 

1 90 4.3 Multicomputer Configuration 

191 MC/OS expects the host computer system to configure the multicomputer. The 

192 configmc program reads a textual description of the computer system 

193 configuration, and produces a series of binary data structures that describe the 

1 94 computer system configuration. These data structures are used in MC/OS to 

1 95 describe the routing and configuration of the multicomputer. 

196 The MCW-1 board will use almost exactly the same sequence to configure the 

197 multicomputer. The major difference is that MC/OS expects configurations to be 

198 totally static, whereas the MCW-1 configuration will need to change dynamically 

199 as faulty hardware cause various resources to be unavailable for use. 

200 There are currently two proposals being considered for how this dynamic 

201 reconfiguration takes place. 

202 The first proposal is that the binary data structures produced by configmc are 

203 modified to include flags that indicate whether a piece of hardware is usable or 

204 not. A modification to MC/OS would prevent it from using hardware marked as 

205 broken. The risk here is that the modifications to MC/OS may be non-trivial. The 

206 benefit may be faster reboot times. 

207 The second proposal is that the output of the H.A. algorithm is used to produce 

208 a new configuration file input to configmc, the configmc execution is repeated 

209 with the new file, and MC/OS is configured and loaded with no knowledge of the 

210 broken hardware whatsoever. This proposal has the added benefit that configmc 

21 1 may be able to calculate the most optimal routing tables in the face of failed 

212 hardware, minimizing the performance impact of the failure on the remaining 

2 1 3 components. This proposal provides risk reduction given [hat MC/OS changes 

214 would nol be required. 
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215 4.4 Multicomputer Loading 

216 After the host computer has configured the multicomputer, the mnmc program 

2 1 7 loads the functional compute elements with a copy of MC/OS. The only changes 

2 1 8 required for the MCW- 1 board is for the loading process to examine which 

219 hardware may be offline because it is faulty, and take this into account when 

220 determining which compute elements need to be loaded. 

221 4.5 TCPIIP Bridge 

222 We believe that the customer is likely to require access to the MCW-1 board 

223 from a TCP/IP network. MC/OS nodes do not contain a TCP/IP stack; therefore 

224 the host computer system acts as a connection to the TCP/IP network. The 

225 VxWorks operating system contains a fully functional TCP/IP stack. All currently 

226 envisioned daemons that need access to the TCP/IP network will run on the host 

227 processor. Should the need arise for compute elements to access network 

228 resources, the host computer would have to act as a proxy, exchanging 

229 information with the compute element utilizing DX transfers, and then making the 

230 appropriate TCP/IP calls on behalf of the compute element. 

231 4.6 File System 

232 The host computer system needs a file system to store configuration files, 

233 executable programs, and MC/OS images. Rotating disks have insufficient MTBF 

234 times; therefore flash memory will be utilized. Rather than have a separate flash 

235 memory from the host computer boot flash, the same flash is utilized for both 

236 bootstrap purposes and for holding file system data. A commercial flash file 

237 system will be purchased and ported which provides DOS file system semantics 

238 as well as write wear leveling. Wear leveling attempts to spread the number of 

239 writes evenly across the sectors of flash memory, as flash memory can only be 

240 written a finite number of times before it is worn out. Modern flash devices can be 

241 written around 100,000 times before they are worn out. 

242 4.7 Remote Software Upgrade 

243 The current design of the MCW-1 board assumes that the customer will want 

244 to update system and application code in the field, via network. There are two 

245 portions of code which need to be updated - the bootstrap code which is executed 

246 by the 8240 processor when it comes out of reset, and the rest of the code which 

247 resides on the flash file system as files. 

248 When code is initially downloaded to the MCW-1, it is written as a group of 

249 files within a directory in the flash file system. A single top level file keeps track 

250 of which directory tree is used to boot the system. This file continues to point at 

25 1 the existing directory tree until a download of new software is successfully 

252 completed. When a download has been completed and verified, the top-level file 

253 is updated to point to the new directory tree, the boot flash is rewritten, and the 

254 system can be rebooted. 

255 A possible problem in multi-board systems is how to deal with different 

256 versions of released software on different boards. For instance, if board 1 has 

257 revision 1.0 of the software distribution, and board 2 has revision 1.1 of the 

258 software distribution, will the two versions work together, or will there be a way 
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259 to ensure that the same version of software is installed on all boards. This issue 

260 does not occur on the MCW-1 because it is a single board solution; therefore this 

261 issue can be addressed at a later time. 

262 A commercial solution to remote software upgrade is available, and has been 

263 ported to Vx Works. It is our intent to port this code at a future date. 

264 5 High Availability 

265 5.1 Goals 

266 The goal of the high availability features of the MCW-1 is to increase the 

267 MTBF of the system as much as possible with little or no increase in cost to the 

268 board. The requirement for minimal cost increase rules out such common 

269 approaches as hot or cold standby, replicated hardware, etc. 

270 It is not a goal to provide uninterrupted computing during hardware or software 

271 failures, nor is it a goal to provide fault tolerance. 

272 4t 25.2 Fault Detection & Isolation 

273 Fault detection is performed by having each CPU in the system gather as much 

274 information about what it observed during a fault, and then comparing the 

275 information in order to detect which components could be the common cause of 

276 the symptoms. In some cases, it may take multiple faults before the algorithm can 

277 detect which component is at fault. The requirement not to add expensive 

278 hardware for fault detection means that in many cases the algorithm will not be 

279 able to determine which component is at fault. 

280 The MCW-1 board has many single points of failure. Specifically, everything 

281 on the board is a single point of failure except for the compute elements. This 

282 means that the only hard failures that can be configured out are failures in the 

283 compute elements. However, many failures are transient or soft, and these can be 

284 recovered from with a reboot cycle. Therefore, we expect the high availability 

285 features to have a positive effect on the MTBF of the card. 

286 More detailed information is available in the functional design specification 

287 (1). 

288 5.3 Degraded Application 

289 In the case of hard failures of a compute element, the application will have to 

290 execute with reduced demand for computing resources. There are several 

291 strategies possible for the MUD algorithm to decrease computing demands, such 

292 as working with a smaller number of interference sources, or performing a less 

293 complete job of interference cancellation. 

294 We expect the computing requirements of the algorithm to be high enough that 

295 failure of more than a single compute element will cause the board to be 

296 inoperative. Therefore, the MCW-1 application only needs to handle two 

297 configurations: all compute elements functional and 1 compute element 

298 unavailable. We believe that a small amount of startup code can map the 

299 application onto the two possible configurations. Note that the single crossbar 

300 means that there are no issues as to which processes need to go on which 

301 processors - the bandwidth and latencies for any node to any other node are 
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302 identical on the MCW- 1 . This will not be true of larger systems in the future, and 

303 we will eventually need a way to map computing and I/O requirements onto 

304 arbitrary hardware configurations. 

305 5.4 Remote Software Upgrade 

306 Downtime due to the updating of software is counted against the availability of 

307 a computer system, and therefore a remote reload of software is a necessity. The 

308 MCW-1 is capable of downloading new software during normal operation. The 

309 reboot strategy means that the downtime due to starting up new software is only a 

310 few seconds. 
311 

312 Referenced Documents 

313 

314 1 . "MC/OS High Availability Functional Design Specification", Yevgeniy 

315 Tarashchanskiy, 17 April, 2000. 
316 
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3 MERCURY PART NUMBER 

The board identifier name is MCW-la and the Mercury part number is 560549. 
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4 FUNCTIONAL DESCRIPTION 
4.1 OVERVIEW 

The MCW-la is designed to be an algorithm processing daughter card utilizing the MPC7400 PPC, MPC8240, 
PCE133 ASIC, XBAR++ ASIC, and PXB++ FPGA. The MCW-1 mates with a Motorola base station modem board. 
MCW-la can provide additional connectivity between processing elements in different sector slots utilizing over-the-top 
RACEway++ cables. It is a Motorola form factor card with four computational nodes and one host node. The 
computational nodes (CNs) are based on the latest MPC7400 PPC microprocessor and the host is an MPC8240. The 
MCW-lcan provide one Ethernet 10/100 BT port on the front panel. A 32-bit, 66 MHz PCI interface provide the 
interface to the Motorola board. 

The MCW-1 a block diagram is shown in Figure 1 . 



MCW-la Functional Specification 



Created on 2/2/01 
- 7- 



Page No. 22 



EV 093 931 868 US 
Page No. 49 

Mercury Computer Systems, Inc. 



COMPANY CONFIDENTIAL 



2.5Volt Supply 



3.3Volt Supply 



Vcorel Supply 



Vcore2 Supply 



Power 
Status/Control 



Front Panel 
Race++ 



NC 



0 \ 

7 XBAR 3 
5 

4 6 



Ethernet 




PXB++ 


10/100BT 









64/33 



PCI/PCI 
Bridge 



32/66 



I2C 



RTC 



MPC8240 



HA 

Registers 



Modem Conn 



SDRAM 
64Kb 



FU\SH 
32MByte 



NvRAM 
8MByte 



NovKA 
M 

8Kb — 



Register 
s 



SDRAM 
128Mb 



NOVKA 
M 

SKtJ 



Register 
s 



SDRAM 
128Mb 



NovKA 
M 

8Kb 



— KA — 
Register 
s 



SDRAM 
128Mb 



NovKA 
M 

6KP 



SDRAM 
128Mb 



CE 



JTAG 



PCE133 
ASIC 



MPC7400 



L2 
Cache 
2Mb 



CE 



JTAG 



PCE133 
ASIC 



e> MPC7400 



L2 
Cache 
2Mb 



CE 



JTAG 



PCE133 
ASIC 



I 







L2 


MPC7400 




Cache 






2Mb 



CE 



JTAG 



PCE133 
ASIC 



MPC7400 



L2 

♦I Cache 
2Mb 



Figure 1 . MCW-1 A BLOCK DIAGRAM 
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Figure 2 shows the MCW-la system topology. Table 1 gives the proposed route codes for the board. 
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Figure 2. MCW-1 A BOARD-LEVEL TOPOLOGY 
Table 1. Route Codes for MCW-la Board XBAR 



Route Code 


Destination for Virtual Ports 


Physical XBAR 1 Ports 


0 






1 






2 






3 






4 






5 






6 






7 






8 






9 






10 






11 






12 






13 






14 






15 







MCW-la Functional Specification 



Created on 2/2/01 
- 9- 



Page No. 24 



EV 093 931 868 US 

Page No. 51 company confidential 

Mercury Com puter Systems. Inc. — 



4.2 FEATURES 

• Custom size daughter card 

. Master PCI 32-bit @ 66MHz compliant with REV2.2 PCI local bus spec. 
PCI write peak performance is 240 MB/sec. 
PCI read peak performance is 220 MB/sec. 

• Single 1EEE802.3 compliant Ethernet 1 0B ASE-T// 100BASE-T 

• Four computation nodes (CNs) based on MPC7400 PPC running @ 400 I 
1 MB L2 cache per CN @ 200 MHz to 266 MHz. 

128 MB SDRAM with ECC per CN @ 133 MHz. 
Hardware based watchdog monitor. 
One PCE 133 ASIC per CN. 

• Two, over-the-top, 66 MHz RACEway++ interlink ports 
configured in cable mode. 

• PCI interface 32-bit @ 66 MHz. 

• R#CEway++ crossbar to connect nodes. 

• PXB-H- 64-bit @ 33 MHz PCI bus. 

• Non-transparent 64-bit/33 MHz to 32-bit/66 MHz PCI bridge. 

• 200MHz PPC8240 PowerPC processor. 
32-bit 33MHz PCI bus. 

100MHz, 64Mbytes SDRAM. 

• Bulk FLASH interface. 
Linear address mode. 
32 banks of 1 Mbytes. 

• LEDs. 

• 8Kbytes non-volatile SRAM. 

• Real time clock. 

• Compute node fault isolation control. 

• JTAG test port. 
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4.3 CONFIGURATION OPTIONS 

4.3.1 CPU Options 

• MPC7400@400MHz. 
. MPC74IO@400MHz. 

4 3.2 SDRAM Options 

• 128 MB SDRAM @ 133 MHz with ECC. 

4.3.3 FLASH Memory Options 

• 16 MB FLASH memory. 

• 32MB FLASH memory 

4.3.4 Ethernet Options 

• No Ethernet. 

• Simgle Ethernet 
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4.4 REQUIREMENTS 

4.4.1 Mechanical Form Factor 

The MCW-la form factor conforms to TBD Motorola mechanical requirements. 
"eM^^ 

core of MPC7400 is converted from +5.0V on the board. There are two core supplies used to power the four cpu cores. 
■ The 2 5V voltage required is converted from +5.0V by an onboard power supply. The 3.3V voltage required is also 
converted from +5.0V by an onboard power supply. The MCW-1 a estimated typical power dissipation is 50 watts @ 
5.0V. 

The MCW-Ta protid^s^PCl 32-bit, 66 MHz interface to the Motorola modem board via an 80-pin connector. 

The MCW-la provides two over-the-top RACEway++ ports via two connectors located on the front panel. 

The MCW-la provides the single Ethernet 10/100 BT interface available from one RJ-45 connector. The Ethernet 
interface is provided by a third party Ethemet-to-PCl interface controller chip that is bridged to the crossbar 
RACEway-H- port by means of a PXB-H- FPGA (See Figure 2). 

4.4.4 Functional 

1 . Shall have the Main SDRAM memory at 1 33MHz or greater. 

2. Shall have a 1 Mbyte L2 Cache at 200MHz or greater. 

3. All CE nodes shall have 128Mbyte of SDRAM. 

4. Host node shall have at least 32Mbytes of nonvolatile memory. 

Form factor requirements: 

5 Shall be a daughter card that is % of a Motorola proprietary form factor modem payload card sized 1 1" by 14". On 
20mm centers board to board.{actual shape, dimensions etc TBD via drawings from Motorola.} 

6. Shall be electrically a PMC module, TBD from further discussions with customer. 

7. Shall use PI , P2 for 32/66MHz PCI bus. 

8. Shall have a maximum heat dissipation of 50W 

System requirements 

9. A minimum of 105Mbyte/sec from the modem payload module to the MCW-la card shall be provided through the 
10 Sm!? MCW-la card to Motorola Modem Payload module output bandwidth shall be at least 200kbyte/se C> 

, 1 . ^tSS^'Kl-. 250Mbyte/sec between CE's, e.g. RAC E++ a, 66Mh Z> as a minimum. 

12. Shall have non-volatile memory, for at least 32Mbytes of data. 

13. Shall support software upgrade from remote locations. 

4.5 COMPATIBILITY 

The MCW-la board is a custom daughter card designed for the Motorola base station modem board. 

The PC^ bu^smna^r^nd^the PXB4+ FPGA limits the RACEway++ to the PCI performance. Peak transfers of 240 
MB/sec are achievable between the PXB-H- , PPC8240 and the non-transparent PCI Bridge. (See Figure 1) 
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Data transfers of up to 266 MB/sec peak are supported for aeeess from RACEway++ to/from the MPC7400 CE's local 
SDRAM memory. 

PCE133 ASIC-initiated DMA transfers run at optimum RACEway++ speeds approaching 266 MB/sec peak. Data can 
U ,r from a single DMA command transfer to/from the CN's ^.^™^ hm 
RACEway++. The DMA engine formats transfers across RACEway++ opt.mally us.ng packets up to 2048 Bytes. 

The operating clock frequency of the PCE133 ASIC, SDRAM, and MPC7400 processor bus is 

operating frequency for the RACEway-H- is 66 MHz. The local PCI clock is used by the correspond PXB++ FPGA 

and does not exceed 33 MHz. 

A separate 25 MHz oscillator is included on the MCW-la for driving the Ethernet interface. 

4.7 DETAILED DESCRIPTION 

The MCW- 1 a block diagram is shown in Figure 1 . 

4.7.1 Modem Board Interface 

TBD (PCI 32-bit 66MHz). 
TBD PCI to PCI bridge stuff. 
TBD Motorola requirements. 



4.7.2 Board Resets 

There are several sources of reset to the daughter card. A MAX823 voltage supervisor will generate a 200ms 
rrZ Zlr VCC rises above 4 38 volts. When the MAX823 reset is deasserted, state machine logic will 
PC. 4sTt tZ> state machine will continue driving RESET.O until both the MAX823 and 
PCI RESET 0 are deasserted. Either reset wil. generate the signal RESET_0 wh.ch w,„ reset te cart tnto tts 
power- staTe. RESET.O will also generate the HRESET 0 and TR ^ s.gmls to me 
and TRST for each of the cpus can also be generated by their JTAG ports, J J A °- H ; E |^r" a ™ 
JTAG TRST respectively. The MCP8240 is capable of generating a reset request a soft reset (C-SRESET_0) 
o each CPU a checkstop request, and a CE ASIC reset (CE_RESET_0) to each of the four CE ASICs. A 
disc ete from the 5v powered reset PLD will generate the signal NPORESETJ (not a power on reset). Ttas 
S-Snl the MPC8240's discrete input word. The MPC8240 will read this signal as a log.c low only f 
doming r 0 f reset due to either a power condition or an externa, reset from f^^^^ 
as the MPC8240 may request a board level reset. These requests are majonty voted, and the result 
RESET VOTE_0 will generate a board level reset 
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Figure 3 shows the MCW-la hard reset generation function 
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Figure 3. HARD RESET FUNCTIONAL BLOCK DIAGRAM 
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4.7.3 Watchdog Monitor 

There are five independent watchdog monitors on the MCW-Ia card. Each processor node is responsible for 
strobing its watchdog once every 20 msec (initial window after board level reset is 2 sec) but no sooner than 
500 usee. Strobing the watchdog for the processing nodes is accomplished by writing a zero/one sequence to 
the D1AG3 discrete coming from the PC133PCE ASIC. The MPC8240 , s watchdog is serviced by writing to 
the memory mapped discrete location FFFF_D027. A- single write of any value will strobe the watchdog. Upon 
power-on, the watchdogs come up in a failed state; once a valid strobe is issued; the watchdog will be satisfied. 
If the CPU fails to service the watchdog within the valid window, the watchdog will fail. A watchdog of a 
failing processing node will trigger an interrupt to the MPC8240. An MPC8240 watchdog fault will trigger a 
reset to the board. The watchdog will then remain in a latched failed state until a CPU reset occurs followed by 
a valid service sequence. Figure 4 shows a valid service sequences of the watchdog. 

Reset 



Software service 



u 



u 



-ss- 



2 seconds 



500 useo] 20msec 



500 usecj 20 msec 



Watchdog Fault conditions 

. A service within 500 usee of last service. 
- No service within 20 msec of last service. 



500 usee 20 msec 



500 useo] 20 msec 



Figure 4. EXAMPLE WATCHDOG SERVICE SEQUENCES 



4.7.4 Operating Frequency 

The MPC7400 bus runs at 133 MHz. The L2 cache bus of the MPC7400 runs at 200 MHz to 266 MHz. The SDRAMs 
run at 133 MHz. The RACEway-H- interface runs at 66 MHz. The local PCI bus runs at 33 MHz and the off board PCI 
runs at 66MHz. The MPC8240's internal frequency is 200 MHz while its SDRAM interface is 100 MHz. 

4.7.4.1 Clock Margining 

This card has two crystal oscillators for the three clock domains present on the card, a 66 MHz oscillator for the 
RACEway++ interface and MPC7400 CNs. The 66MHz frequency is divided in half to generate a 33 MHz signal for 
the PCI interface. A second oscillator, 25 MHz, clocks the Ethernet and watchdog circuitry. Both the PCI and MPC 
clocks are marginable. In order to provide clock margining, a 4-pin connector allows the test engineer to functionally 
disable the onboard oscillator and replace it with a test frequency. The pinout of this connector is detailed in Table 2. 
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Table 2. Test Clock Connector 



Pin 


Signal 


1 


GND 


2 


/Test Clock 


3 


Test Clock 


4 


Test Clock Enable L 



4.7.5 Serial Configuration EEPROM 

There are several serial EEPROMs used to loadconflguration to the CE ASICs, PXB++ and XBAR++ after reset. The 
serial PROM functionality can be found in the ASIC's functional specification. 

4.7.5.1 CE ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway++ bus. It is programmed during 
manufacture of the MCW-la to contain configuration information for CE ASIC. The serial EEPROM AT24C128 is 
controlled from the CE ASIC. After reset, the CE ASIC automatically reads the first location from the serial EPROM. 
Refer to the CE ASIC functional specification, reference 3, for information on reading and writing this device. 



4.7.5.2 PXB++ FPGA Serial EEPROM 

The serial EEPROM can be read and programmed by means of the PCI bus or the RACEway-H- bus. It is programmed 
during manufacture of the MCW-la to contain configuration information for PXB. The serial EEPROM AT24C128 
device is 1 28K bits and is controlled from the PXB++. After reset, the PXB++ automatically reads 8 KB from the serial 
EEPROM and initializes the PXB++ internal registers. Refer to the PXB++ FPGA functional specification, reference 5, 
for information on reading and writing this device. 

4.7.5.3 XBAR++ ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway-H- bus. It is programmed during 
manufacture of the MCW-la to contain configuration information for XBAR-H-. The serial EEPROM AT24C128 is 
controlled from the XBAR++ ASIC. After reset, the XBAR++ ASIC automatically reads from the serial EPROM and 
initializes the XBAR++ internal registers. Refer to the XBAR++ ASIC functional specification, reference 4, for 
information on reading and writing this device. 



4.7.5.3.1 Register Description 

Reference 4 f describes the registers of the XBAR-H- ASIC. 

4.7.6 RACEway++ Interconnect 

Communication between all processing and I/O elements on the system card is provided by a Mercury eight-port 
crossbar XBAR++ ASIC. The XBAR++ provide up to three simultaneous 266 MB/sec peak throughput data paths 
between elements for a total peak throughput of 798 MB/sec. Three crossbar ports connect to the RapidlO Bridge 
FPGA. Each MPC7400 CN uses one crossbar port. The Ethernet and MPC8240 interface to a crossbar port through the 
PXB++. (See 0) Reference 4 describes the operation and registers of the XBAR++ ASIC. 

4.7.7 Local PCI I/O Bus 

The PXB++ FPGA provides the local PCI I/O bus. This bus is accessible by means of the RACEway-H- from the 
processing nodes. All resources on this bus are initialized and controlled by the MPC8240. This bus provides access to 
an Ethernet controller, PCI to PCI transparent bridge and the PPC8240 host controller. Transfers from devices on this 
local PCI bus to and from devices on the RACEway++ can achieve 240 MB/sec for writes and 220 MB/sec for reads. 
These rates assume block transfers of reasonable size. 
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4.7.7.1 PXB++ Program EEPROM 

The PXB++ FPGA is programmed by an XCI8V04 configuration EEPROM running m parallel mode. Configuration 
initiates when a power-on or board level reset occurs. Dividing the onboard 33MHz generates the configuration clock of 
16.6MHz. The configuration EEPROM itself is onboard programmable through the JTAG scan chain. 



4.7.7.1.1 Register Description 

Reference 5 describes the registers of the PXB++ ASIC. 



4.7.8 Ethernet Interface 

The PCI-to-Ethernet interface uses the AM79C973 Pcnet-FAST 111 single chip 10/100 Mbps Ethernet controller. This 
device is equipped with a built in physical layer interface to achieve a minimal parts count Ethernet interface. A 25 MHz 
oscillator provides the proper clock frequency to the Ethernet chip. The PCI interrupt from the Ethernet chip is wired to 
the MPC8240's external interrupt controller. 



4.7.9 MPC7400 or Nitro Computer Nodes (CNs) 

The board contains four MPC7400 CNs. Each MPC CN uses a PCE133 ASIC to interface the cpu to RACEway++. The 
PCE133 ASIC provides all the standard features of a CN, such as a DMA engine, mail box interrupts, timers, 
RACEway++ page mapping registers, SDRAM interface, and so on. Local memory for each CN consists of 32, 64, or 
128 MB SDRAM, and L2 cache SRAM. Each CN also has a nonvolatile SRAM and watchdog monitor. The cpu bus is 
64-bit data, 32-bit address, and operates synchronously at 1 33 MHz. 



4.7.9.1 Processor 

The MCW-la card is designed to use either the 400 MHz MPC7400 or the 400 MHz Nitro processors. The processor is 
packaged in a 25mm, 360-ball CBG A package. Each processor requires the attachment of a heat sink to keep it within 
its thermal limits. 



4.7.9.2 MPC7400 L2 Cache 

The MPC7400 L2 cache for each CN is composed of pipelined, single-cycle deselect, sync burst SRAM. This is 
implemented using two 64K, 128K, or 256K by 36-bit sync burst SRAM parts to make a 0.5 MB, 1 MB, or 2 MB 
cache. MPC7400 L2 cache can be depopulated to 0 MB. 



4.7.9.3 PCE133 ASIC 

The MPC processor compute element ASIC (PCEI33 ASIC) is a Mercury-designed component. It provides the 
interface between the MPC7400, the synchronous DRAM, and the RACEway++. All the PCE133 features such as 
DMA, mailbox interrupts, timers, address snooping, prefetch buffers, and so on, are available in this configuration. This 
chip is provided in a 35mm, 388-ball BGA package. Reference 3 describes the operation and registers of the PCEI33 
ASIC. 



4.7.9.3.1 Register Description 

Reference 3 describes the registers of the PCE133 ASIC. 



4.7.9.4 Address Map 



4.7.9.4.1 Master Address Map 
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Transfers from the MPC7400 to the PCE133 ASIC and RACEway++ are address mapped as shown in Table 1. 

SDRAM is 8-, 16, 32, or 64-bi, addressable. RACEway++ locked read/write and locked read 
transactions are supported for all data sizes. The 16 Mbyte boot FLASH area .s further d.ytded in Table 4 
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Table 3. Master Address Map 



From Address 


To Address 


Function 


0x0000 0000 


OxOFFF FFFF 


Local SDRAM 256 MB 


0x1000 0000 


OxlFFFFFFF 


XBAR 256 MB map window 1 


0x2000 0000 


0x2FFF FFFF 


XBAR 256 MB map window 2 


0x3000 0000 


0x3FFF FFFF 


XBAR 256 MB map window 3 


0x4000 0000 


0x4FFF FFFF 


XBAR 256 MB map window 4 


0x5000 0000 


0x5FFF FFFF 


XBAR 256 MB map window 5 


0x6000 0000 


0x6FFF FFFF 


XBAR 256 MB map window 6 


0x7000 0000 


0x7FFF FFFF 


XBAR 256 MB map window 7 


0x8000 0000 


0x8FFF FFFF 


XBAR 256 MB map window 8 


0x9000 0000 


0x9FFF FFFF 


XBAR 256 MB map window 9 


OxAOOO 0000 


OxAFFF FFFF 


XBAR 256 MB map window A 


OxBOOO 0000 


OxBFFF FFFF 


XBAR 256 MB map window B 


OxCOOO 0000 


OxCFFF FFFF 


XBAR 256 MB map window C 


OxDOOOOOOO 


OxDFFF FFFF 


XBAR 256 MB map window D 


OxEOOO 0000 


OxEFFF FFFF 


XBAR 256 MB map window E 


OxFOOO 0000 


OxFBFF FBFF 


Not used fCE ree replicated mapping) 


OxFBFF FCOO 


OxFBFF FDFF 


Internal CN ASIC registers 


OxFBFF FEOO 


OxFEFF FFFF 


Prefetch control 


OxFFOO 0000 


OxFFFF FFFF 


16MB boot FLASH memory area 



Table 4. Boot FLASH Address Map 



From Address 


To Address 


Function 


OxFFOO 2006 


OxFFOO 2006 


Software Fail Register 


OxFFOO 2005 


OxFFOO 2005 


M PC 8240 HA Register 


OxFFOO 2004 


OxFFOO 2004 


Node 3 HA Register 


OxFFOO 2003 


OxFFOO 2003 


Node 2 HA Register 


OxFFOO 2002 


OxFFOO 2002 


Node 1 HA Register 


OxFFOO 2001 


OxFFOO 2001 


Node 0 HA Register 


OxFFOO 2000 


OxFFOO 2000 


Local HA Register (status/control) 


OxFFOO 0000 


OxFFOO 1FFF 


NovRAM 



4.7.9.4.2 Slave Address Map 

Slave accesses are defined as accesses initiated by an externa) RACEway++ device directed toward the MPC7400 CN 
The MPC is not accessible as a slave device. The SDRAM is 8-, 16-, 32-, or 64-bit addressable^ RACEway++ locked 
read/write and locked read are supported for all data sizes. The PCE RACEway port supports a 256 MB address space 
partitioned as follows in Table 5: 

Table 5. Slave Address Map 



From Address 


To Address 


Function 


0x0000 0000 


OxOFFF FBFF 


256 MB less 1 KB hole SDRAM 


OXffT FCOO 


OxFFF FFFF 


PCE 133 internal registers 



Re^nce 12s the internal interrupt sources for the PCE133 ASIC. The extemalinterrupt pin on the PCE 133 
ASIC is driven by the HA PLD and is currently not used. The interrupt output from the PCE 133 ASIC is wired to the 
CPU's external interrupt input pin. 
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4.7.9.6 PCE133 DIAG Bits lj niAr . 

The D1AG3 signal is wired to the HA PLD and is used to strobe the nodes hardware watchdog monitor. I he UlAOi 
signal is wired to the MPC8240's interrupt controller and is used, by the node, to generate a general purpose interrupt to 
the MPC8240. The D1AGBIT signal is wired to the HA PLD and is currently not used. 

is driven by three sources gated together: the HRESETJ) pin on the PCE133 i ASIC, 
HRESET_0 from the JTAG connector, and HRESETJ) from the majority voter. The HRESETJ) pin from the CE AblC 
is set by the "node run" bit field (bit 0) of the PCE133 ASICs Miscon_A register. Setting HRESETJ) low causes the 
MPC7400 to be held in reset. HRESETJ) is low immediately after system reset or power-up, the MPC7400 is held m 
reset until the HRESETJ) line is pulled high by setting the node run bit to 1. The JTAG HRESETJ) is controlled by 
debugger software when a JTAG debugger module is connected to the card. The HRESETJ) from the majority voter is 
generated by a majority vote from all healthy nodes to reset. 

4.7.9.8 Boot Procedures 

When a cpu reset is asserted, the MPC7400 is put into reset state. The MPC7400 will remain in a reset state until the 
RUN bit 0 of the Miscon.A register is set to 1 and the MPC8240 has released the reset signals in the discrete output 
word The RUN bit should be set to 1 after the boot code has been loaded into the SDRAM starting at location 
0x0000 J)100. The ASIC maps the reset vector OxFFFOJHOO generated by the MPC7400 to address 0x0000.0100. 

4.7.9.9 MPC7400CN SDRAM . 

The main memory for each CN is composed of one bank of synchronous DRAM. This is implemented using five 
K4S280832A-TC/L75 @133 MHz synchronous DRAM parts. As shown in the memory map (See Table 3), the mam 
memory begins at address 0x0 and grows upward in the address space as memory is increased. The PCE133 ASIC 
supports error correction (ECC) on the SDRAM. 

The SDRAM operates as zero wait state memory and can provide up to 1 GB/sec peak bandwidth on writes from 
MPC7400 and 800 MB/sec peak bandwidth on read from the MPC7400. ECC error correction is supported. 

4.7.9.10 MPC7400 Non-Volatile RAM . 
Each node will be equipped 8Kx8 of non-volatile RAM for the storage of fault record data and configuration 
information. This function is implemented using a S1MTEK STK12C68S45 NOVRAM attached to the PCE1 33 ASICs 
boot FLASH interface. The data bus of the device is isolated from the PCE ASIC through an IDT 1DTQS32244SO 
buffer. This buffer provides loading isolation and 3.3v to 5v translation. 
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4.7.10 MPC8240 Host Controller . v. u„aa.a upt«iv PowerPC 

The MPC8240 integrated processor is comprised of a peripheral log.c block and .32*1 ta *^^^T^ 
processor core The peripheral logic integrates a PCI bridge, memory controller, DMA controller, EPIC interrupt 

J2 and an ?2C controller. The processor core is a full featured high-perforrnance processor wrth 
Hoalg-poin. support, memory management, 16Kbytes instruction cache, l6Kby.es data cache, and power management 
features. 

Major features of the MPC8240 are as follows: 
Peripheral logic 

- Memory interface 

High-bandwidth bus, 64-bit data bus, to SDRAM. 
ECC Protected SDRAM 
16 Mbytes of ROM space (32Mbytes paged). 
8-bit ROM. 

Write buffering for PCI and processor accesses. 

- PCI Interface 

32-bit PCI interface operating at 33 MHz (66 MHz capable). 
PCI 2.1 -compatible. 

Support for accesses to all PCI address spaces. 
Selectable big- or little-endian operation. 

Store gathering of processor-to-PCl write and PCI-to-memory write accesses. 
PCI bus arbitration unit (five request/grant pairs). 

- Two-channel integrated DMA controller r . 

Supports direct mode or chaining mode (automatic linking of DMA transfers). 
Supports scatter gathering read or write discontinuous memory. 
Interrupt on completed segment, chain, and error. 
Local-to-local memory. 
PCI-to-PCI memory. 
PCI-to-local memory. 
Local-to-PCI memory. 
- Message unit 

Two doorbell registers. 

Inbound and outbound messaging registers. 

I 2 O message controller. 

- I 2 C controller with full master/slave support 

- Embedded programmable interrupt controller (EPIC) 

Five hardware interrupts (IRQs) or 16 serial interrupts. 
Four programmable timers. 

- Integrated PCI bus and SDRAM clock generation 

- Programmable memory and PCI bus output drivers 

- Debug features 

Memory attribute and PCI attribute signals. 
Debug address signals. 
. MIV signal: Marks valid address and data bus cycles on the memory bus. 
Error injection/capture on data path. 
IEEE 1 149.1 (JTAG)Aest interface. 
Processor core 

- High-performance, superscalar processor core 

Integer unit (1U). 

Foating-point unit (FPU) (user enabled or disabled). 
Load/store unit (LSD). 
System register unit (SRU). 
Branch processing unit (BPU). 
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- 16-Kbyte instruction cache 

- 16-Kbyte data cache 

- Lockable LI cache - entire cache or on a per-way basis 

- Dynamic power management 

4.7.10.1 Address Map 

The MPC8240 in PCI host mode supports two address mapping configurations designated as address map A, and 
address map B. Address map A conforms to the PowerPC reference platform (PReP) specification. Address map B 
conforms to the PowerPC microprocessor common hardware reference platform (CHRP). Note that the support of map 
A is provided for backward compatibility only. It is strongly recommended that new designs use map B because map A 
may not be supported in future devices. 

Address map B complies with the PowerPC microprocessor common hardware reference platform (CHRP). The address 
space of map B is divided into four areas: system memory, PCI memory, PCI I/O, and system ROM space. When 
configured for map B, the MPC8240 translates addresses across the internal peripheral logic bus and the external PCI 
bus as shown in Table 6. 

Table 6. MPC8240 Address Map B 



Processor Core Address Range 


PCI Address Range 


Definition 


Hex 




0000,0000 


0009.FFFF 


0 


640K- 1 


NO PCI CYCLE 


System memory 


OOOA.0000 


000F.FFFF 


640K 


1M-1 


000A.0000 - 000F.FFFF 


Compatibility hole 


0010.0000 


3FFF.FFFF 


1M 


1G-1 


NO PCI CYCLE 


System memory 


4000.0000 


7FFF.FFFF 


1G 


2G-1 


NO PCI CYCLE 


Reserved 


8000.0000 


FCFF.FFFF 


2G 


4G-48M-1 


8000J)000 - FCFF.FFFF 


PCI memory 


FDOO.OOOO 


FDFF.FFFF 


4G-48M 


4G-32M-1 


0000.0000 - 00FF.FFFF 


PC1/1S A memory 


FEOO.0000 


FE7F.FFFF 


4G-32M 


4G-24M-1 


0000.0000- 007 F.FFFF 


PCI/ISA I/O 


FE80.0000 


FEBF.FFFF 


4G-24M 


4G-20M-1 


0080.0000 - 00BF.FFFF 


PCI I/O 


FEC0.0000 


FEDF.FFFF 


4G-20M 


4G-18M-1 


CONFIG.ADDR 


PCI configuration address 


FEE0_0000 


FEEF.FFFF 


4G-18M 


4G-17M-1 


CONFlG.DAT A 


PCI configuration data 


FEFO.0000 


FEFF.FFFF 


4G-17M 


4G-16M-1 


FEFO.OOOO - FEFF.FFFF 


PCI interrupt acknowledge 




^II^FffiFFFfr 


4G-16M 


<4G-8M-1 7 . 


;FF06_00O0 - FF7F.FFFF 


32/64-bit FLASH/ROM (I) . 


FF80 0000 


FFFF FFFF 


4G-8M 


4G-1 


FF80 0000 - FFFF.FFFF 


8/32/64-bit FLASH/ROM (2) 



Notes: 

( 1 ) This bank of FLASH is not used. 

(2) This bank of FLASH is configured in 8-bit mode and is further broken down in Table 7. 



Table 7. Port X Address Map 
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Bank 


Processor Core Address Range 


Definition 


Select 








1!]H 


FFE0_0000 


FFEF_FFFF 


Accesses Bank 0 


11110- 


FFEOJ)000 


FFEF_FFFF 


Application code ( 1 ) (30 pages) 


00001 








00000 




fFFEf^FFF^ : : 


?Appiication7bc^t'c6<le (1 )) (2)Ki:£ ' 








|7£pjjlicati^ 




FFFF D000 

lilt L/V/V/U 


FFFF DOOO 


Discrete input word 0 




FFFF DOOl 

1111 LSXJ KJ 1 


FFFF DOOl 


Discrete input word 1 




FFFF D002 

ill r I/Uua 


FFFF Dflf>2 


Discrete output word 0 




FFFF DOO^ 


FFFF DOOl 
rrr i l/vjw j 


Discrete output word 1 




FFFF D004 

I 1 1 * \J\J T 


FFFF D004 

Jill M-J\J\Jj 


Discrete output word 2 




FFFF DO 10 

1111 _*■«'" » v 


FFFF DO 10 

Alii Y~J\J I V 


IC (Pending interrupt) 




FFFF DO 11 


FFFF D01 1 


IC (Interrupt mask low) 




FFFF DO 12 


FFFF DO 12 


IC (Interrupt clear low) 




FFFF DOn 


FFFF DOn 


IC (Unmasked, pending low) 


XXXX (3) 


FFFF 0014 


FFFF DOM 
ri rr_L/vH 


IC (Interrupt input low) 


FFFF DO IS 


FFFF D01S 

r r i i i ~ ' 


Unused (read FF) 




FFFF DO 16 


FFFF DO 16 
rrr r_i_/v/ 1 vj 


Unused (read FF) 




FFFF H0 17 
rrrr_uu i / 




Unused (read FF) 




FFFF_D018 


FFFF_D018 


Unused (read FF) 




FFFF_D0 1 9 


FFFF_D019 


Unused (read FF) 




FFFF_D020 


FFFF_D020 


HA (Local HA register) 




FFFF_D02 1 


FFFF_D02 1 


HA (Node 0 HA register) 




FFFF_D022 


FFFF.D022 


HA (Node 1 HA register) 




FFFFJD023 


FFFF_D023 


HA (Node 2 HA register) 




FFFF_D024 


FFFF_D024 


HA (Node 3 HA register) 




FFFF_D025 


FFFF_D025 


HA (8240 HA register) 




FFFFJD026 


FFFF.D026 


HA (Software Fail) 




FFFF_D027 


FFFF_D027 


HA (Watchdog Strobe) 




FFFF.D028 


FFFFJDFFF 


4068 Bytes FLASH 




FFFF_E000 


FFFF_FFFF 


8K NOVRAM 



Notes: 

(1) Thirtyone 1Mbyte blocks of application memory residing at address FFEO_0000 - FFEF_FFFF selected by the 
FLASH page bits. 

(2) 2Mbyte block available after reset. 

(3) Always available 



4.7.10.2 Register Description 

Reference 10 describes the registers of the MPC8240. 



4.7.10.3 Interrupt 

The MPC8240 contains an embedded programmable interrupt controller (EPIC) device. The EPIC implements the 
necessary functions to provide a flexible and general -purpose interrupt controller solution. The EPIC pools hardware- 

MC W-l a Functional Specification Created on 2/2/01 

- 23- 



Page No. 38 



EV 093 931 868 US 

PaQe N °* M 5 ercurv Computer Sv^ Inc. COMPANY CONFIDENTIAL 



generated interrupts from many sources, both within the MPC8240 and externally, and delivers them to the processor 
core in a prioritized manner. The solution adopts the OpenPIC architecture (architecture developed jointly by AMD and 
Cyrix for SMP interrupt solutions) and implements the logic and programming structures according to that specification. 
The MPC8240*s EPIC unit supports up to five external interrupts, four internal logic-driven interrupts and four timers 
with interrupts. See Reference 10 for a detailed description of the EPIC unit. 

The five external interrupt inputs to the EPIC are wired to the external interrupt controller PLD. 

4.7.10.4 MPC8240 Reset 

The MPC8240 can be reset from three sources: a board level reset (RESET J)), JTAG controlled reset, or a failure in 
it's watchdog monitor. Any reset to the MPC8240 shall cause the discrete output registers to reset (low) state, this in 
rum, will cause all G4 nodes to enter the reset state. 

4.7.10.5 Boot Procedure 

After the release of reset to the MPC8240, it will begin executing code out of the FLASH memory. A reset will 
automatically set the FLASHSEL(4:0) bits to all zero's, therefore, the MPC8240's boot code must reside in bank 0. 
Once it's application code is copied to SDRAM, the MPC8240 can then sequence through the FLASH banks by setting 
the appropriate bits in the discrete output word. Application code for the G4 nodes resides in the remaining thirtyone 
banks of FLASH. 



4.7.1 1 Bulk FLASH Memory 

There are 32Mbytes of bulk FLASH memory, comprised of two Intel 28F128J3 StrataFLASH memory devices. The 
MPC8240's memory map limits the size of the 8-bit wide FLASH to 2Mbytes, this requires hardware to divide the 
FLASH into thirty-two 1Mbyte banks. Five software-controlled discretes allow switching between banks. Accesses to 
the 1Mbyte address range of FFE0_0000 through FFEF_FFFF will always access the first first block of FLASH, 
NOVRAM,Discrete I/O, HA registers, watchdog monitor, and the interrupt controller. Accesses to the 1Mbyte address 
range of FFFO_0000 through FFFF_FFFF will access a page of memory in the FLASH. The actual page is selected is 
based on the five FLASH select bits, driven by the Discrete Output word. 

4.7.12 Real Time Clock 

The PCF8563 is a CMOS real-time clock/calendar optimized for low power consumption. A programmable clock 
output, interrupt output and voltage-low detector are also provided. All addresses and data are transferred serially via a 
two-line bidirectional I 2 C-bus. Maximum bus speed is 400 kbits/s. 

Real Time Clock Features: 

- Provides year, month, day, weekday, hours, minutes and seconds 

(Based on an external 32.768 kHz quartz crystal) 

- Century flag 

- Wide operating supply voltage range: 1.0 to 5.5 V 

- Low back-up current; typical 0.25 mA at VDD = 3.0 V and Tamb =2 °C 

- 400 kHz two-wire 1 2 C-bus interface (at VDD = 1 .8 to 5.5 V) 

- Programmable clock output for peripheral devices: 32.768 kHz, 1024 Hz, 32 Hz and 1 Hz 

- Alarm and timer functions 

- Voltage-low detector 

- Integrated oscillator capacitor 

- Internal power-on reset 

- 1 2 C-bus slave address: read A3H; write A2H 

- Open drain interrupt pin 

4.7.13 Nonvolatile Memory 

The MPC8240 will be equipped with 8Kx8 of non-volatile RAM for the storage of fault record data and configuration 
information. This function is implemented using a S1MTEK STK12C68S45 NO VRAM attached to the local bus 
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interface. The device's data bus is isolated from the local bus through an IDT 1DTQS32244SO buffer. This buffer 
provides 3.3v to 5v translation. 

4.7.14 Fault Status and Control Registers 

The MPC8240 has access to five 8-bit status registers. One register represents its own status while the others represent 
that fault status of the other four G4 CPUs. Each register has the identical format as shown in Table 8: 
These five registers grant the MPC8240 status information from each node on the board, without going through the 
Raceway fabric. 



The MPC8240 will have one 8-bit Fault control register. The control register for each CPU will have the following 
format as shown in Table 9: 



Bit 


Name 


Description 


0 


CHECKSTOP.OUT 


Checkstop state of CPU (0 = CPU in checkstop) 


1 


WDM.FAULT 


WDM failed (0 = WDM failed, set high after reset and valid service) 


2 


SOFTWARE_FAULT 


Software fault detected (Set to 0 when a software exception was detected) (RAV local) 


3 


RESETREQJN 


Wrap status of the local CPU's reset request 


4 


WDM_IN1T 


WDM failed in initial 2 second window ( 0 ■= WDM failed) 


5 


Soft ware definable 0 


Software definable 0 


6 


Software definable 1 


Software definable 1 


7 


unused 


unused 



Table 8. Fault Status Register Format 



Bit 


Name 


Description 


0 


RESETREQ_OUT_0 


Request a reset event (0 => forces reset) 


I 


CHKSTOPOUTJ) 


Request that node 0 enter checkstop state (0 -> request checkstop) 


2 


CHKSTOPOUT_l 


Request that node 1 enter checkstop state (0 => request checkstop) 


3 


CHKSTOPOUT.2 


Request that node 2 enter checkstop state (0 => request checkstop) 


4 


CHKSTOPOUT_3 


Request that node 3 enter checkstop state (0 => request checkstop) 


5 


CHKSTOPOUT_8240 


Request that the MPC8240 enter checkstop state (0 => request checkstop) 


6 


Software definable 0 


Software definable 0 


7 


Software definable 1 


Software definable 1 



Table 9. Fault Control Register Definition 



4.7.15 Majority Voter 

There are two different functions controlled by majority voters. The first is local to each CPU, this voter controls the 
assertion of CHECKSTOP_IN to the CPU. The second voter is centralized to the board, it will control the master reset 
to the board. Both voters shall follow the same set of rules: The output will follow the majority of non-checkstopped 
CPUs. A 1-on-l or 2-on-2 condition in either voter will result in a board level reset. 



MCW- 1 a Functional Specification Created on 2/2/0 1 

- 25- 



Page No. 40 



EV 093 931 868 US 
Page No. 67 

Mercury Computer Systems. Inc. COMPANY C ONFIDENTIAL 

4.7.16 Discrete I/O 

There are 16 discrete output signals directly controllable and readable by the MPC8240. The 16 discretes are divided up 
into two addressable 8-bit words. Writing to a discrete output register will cause the upper 8-bits of the data bus to be 
written to the discrete output latch. Reading a discrete output register will drive the 8-bit discrete output onto the upper 
8-bits of the MPC8240's data bus. Table 10 defines the bits in the discrete output word. 

There are 16 discrete input signals accessible by the MPC8240. Reads from the discrete input address space will latch 
the state of the signals, and return the latched state of the discretes to the MPC8240. Table 1 1 defines the bits in the 
discrete input word. 
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Tabic 10. Discrete Output Words 



Word 2 


DH(0:7) 


Signal 


Description 


0 


NDO FLASH 


EN 


1 


Enable the CE ASIC's FLASH port when 1 


1 


ND1 FLASH 


EN 


1 


Enable the CE ASIC's FLASH port when 1 


2 


ND2 FLASH 


EN 


1 


Enable the CE ASIC's FLASH port when.1 


3 


ND3 FLASH 


EN 


1 


Enable the CE ASIC's FLASH port when 1 


4 

5 
6 
7 


Wrap 1 






Wrap to discrete input 



I Word 1 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap to Discrete Input 


1 


I2C RESET 0 


Reset the I2C serial bus when 0 


2 


SWLED 


Software controlled LED 


3 


FLASHSEL4 


Flash bank select address bit 4 


4 


FLASHSEL3 


Flash bank select address bit 3 


5 


FLASHSEL2 


Flash bank select address bit 2 


6 


FLASHSEL1 


Flash bank select address bit 1 


7 


FLASHSELO 


Flash bank select address bit 0 



I Word 0 


DH(0:7) 


Signal 


Description 


0 


C SRESET3 0 


Issue a Soft Reset to cpu on Node 3 when 0 


1 


C PRESET3 0 


Reset PCE133 ASIC Node 3 when 0 


2 


C SRESET2 0 


Issue a Soft Reset to cpu on Node 2 when 0 


3 


C PRESET2 0 


Reset PCE133 ASIC Node 2 when 0 


4 


C SRESET1 0 


Issue a Soft Reset to cpu on Node 1 when 0 


5 


C PRESET1 0 


Reset PCE133 ASIC Node 1 when 0 


6 


C SRESETO 0 


Issue a Soft Reset to cpu on Node 0 when 0 


7 


C PRESETO 0 


Reset PCE133 ASIC Node 0 when 0 



Table 11. Discrete Input Words 



Word 1 


DH(0:7) 


Signal 


Description 


0 , 


WRAP1 


Wrap from discrete output word 


1 


TBD 




2 


V3.3 FAIL 0 


Latched status of power supply since last reset 


3 


V2.5 FAIL 0 


Latched status of power supply since last reset 


4 


VCORE1 FAIL 0 


Latched status of power supply since last reset 


5 


VCORE0 FAIL 0 


Latched status of power supply since last reset 


6 


RIOR CNF DONE_1 


RIO/RACE++ FPGA configuration complete 


7 


PXBO CNF DONE 1 


PXB++ FPGA configuration complete 
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WordO 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap from discrete output word 


1 


WDMSTATUS 


MPC8240's watchdog monitor status (0 = failed) 


2 
3 


NPORESET_1 


Not a power on reset when high 


4 






5 






6 
7 







4.7.17 Interrupt Controller 

The MPC8240 interfaces with an 8-input interrupt controller external from MPC8240 itself. The interrupt inputs are 
wired, through the controller to interrupt zero of the MPC8240 external interrupt inputs. The remaining four MPC8240 
interrupt inputs are unused. 

The Interrupt Controller comprises the following five 8-bit registers; 

Pending Register - A low bit indicates a falling edge was detected on that interrupt (read only) 

Clear Register - Setting a bit low will clear the corresponding latched interrupt (write only) 

Mask Register - Setting a bit low will mask the pending interrupt from generating an MPC8240 interrupt 

Unmasked Pending Register - A low bit indicates a pending interrupt that is not masked out 

Interrupt State Register - indicates the actual logic level of each interrupt input pin 



4.7.17.1 Interrupt Controller Operation 

Table 1 2 lists the interrupt input sources and their bit positions within each of the six registers. A falling edge on an 
interrupt input will set the appropriate bit in the pending register low. The pending register is gated with the mask 
register and any unmasked pending interrupts will activate the interrupt output signal to the MPC8240's external 
interrupt input pin. Software will then read the unmasked pending register to determine which interrupt(s) caused the 
exception. Software can then clear the interrupt(s) by writing a zero to the corresponding bit in the clear register. If 
multiple interrupts are pending, the software has the option of either servicing all pending interrupts at once and then 
clearing the pending register or servicing the highest priority interrupt (software priority scheme) and the clearing that 
single interrupt. If more interrupts are still latched, the interrupt controller will generate a second interrupt to the 
MPC8240 for software to service. This will continue until all interrupts have been serviced. An interrupt that is masked 
will show up in the pending register but not in the unmasked pending register and will not generate an MPC8240 
interrupt. If the mask is then cleared, that pending interrupt will flow through the unmasked pending register and 
generate an MPC8240 interrupt. 



Table 12. Interrupt Controller Inputs 



Bit 


Signal 


Description 


0 


SWFAIL 0 


8240 Software Controlled Fail Discrete 


1 


RTC INT 0 


Real time clock event 


2 


NODE0_FAIL_0 


WDFAIL 0 or IWDFAIL_0 or SWFAIL_0 active 


3 


NODE1_FAIL_0 


WDFAIL 0 or IWDFAIL_0 or SWFAIL_0 active 


4 


NODE2 FAIL 0 


WDFAIL OorlWDFAIL 0 or SWFAIL_0 active 


5 


NODE3_FAIL_0 


WDFAIL 0 or IWDFAIL 0 or SWFAIL.O active 


6 


PCI INT 0 


PCI interrupt 


7 


XB SYS ERR 0 


XBAR internal error 
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4.7.18 Configuration Jumpers 

J 1 8-1 - J 1 8-2, the watchdog monitor mask, when installed, will mask all watchdog failures. 

J 18-3 - J 18-4, the serial EEPROM's write enable jumper, when installed, enables modification of the serial EEPROMs. 
j 18-5 - J 18-6, the flash write-protect jumper, when installed, prevents modification of any flash memory location. 
J 1 8-7 - J 1 8-8, the PXBO use PROM jumper, when installed will enable the PXBO's serial configuration PROM. 

4.7.19 LEDs 

There are nine LEDs, visible at the top of the board. 

LD1 is a software controlled LED 
LD2 is a software controlled LED 
LD3 is the Node 0 watchdog fail LED 
LD4 is the Node 1 watchdog fail LED 
LD5 is the Node 2 watchdog fail LED 
LD6 is the Node 3 watchdog fail LED 
LD7 is the MPC8240 watchdog fail LED 
LD8 indicates the state of the board level reset 
LD9 indicates a XBAR system error. 

There are an additional two LEDs on the Ethernet connector for Ethernet status (located on the Ethernet connector). 

Th?MCwTa e bo S ard requires 3.3V, 2.5V, and 1 .8V. There are two 1 .8V supplies, each drives the core voltage for two 
cpus To provide power to the MCW-1 a, the three voltages must have separate switching supplies, and proper power 
sequencing to the device must be provided. All three voltages are converted from 5.0V. The power to the daughter card 
is provided directly from the modem board. 

4 7 20.1 MPC7400 Core Power Supply 

There are two core voltage power supplies, each one is dedicated to two MPC7400 PPC cores. The core voltage can be 
in the 2.2V to 1.5V range. This power supply is rated at 12A in the range from 2.2V to 1.5V. 

A'SVpol'tpply S3k powerto the SBSRAM core, SDRAM, SCSI, PXB++, and XBAR++ PCE.33 
I/O. This power supply is rated at TBD Amp. 

4.7.203 Core and I/O 2.5V Power Supply 

A 2.5V power supply is used to provide power to the PCE133 and can also power the PXB++ TPGA cwe/The 
MPC7400 processor bus can run at 2.5V signaling. The MPC7400 L2 bus can operate at 2.5V signaling. This 2.5V 
power supply is rated at TBD Amp. 

4.7.20.4 ASICs Power Supplies Tolerance Requirements 

SBSRAM VDD = 3.3V+0.165V/-0.165V power supply 

SBSRAM VDDQ - 3.3V+0.165V/-0.165V for 3.3V I/O or 2.5V+0.4V/-0.125V for 2.5V I/O 
SDRAM VDD= 3.3V+0.3V/-0.3V power supply 
XBAR++ VDD= 3.3V+0.3V/-0.3V power supply 
PCE133 VDD= 2.5V+?V/-?V power supply 
PCE133 VDD33= 3.3V+?V/-?V power supply 
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4.7.20.5 Power Supply Voltage Sequencing 

The power sequencing is yery important in multivoltage digital boards. It is necessary for long-term reliability. The right 
power supply sequencing can be accomplished by using po»er_good and inhibit signals. To provide fail-safe option 
of the device, power should be supplied so that if the core supply fails during operation, the I/O supply >s shut down as 



well. 



The general rule is to ramp all power supplies up and down at the same time. This is shown m Figure 5. In reality, ramp 
up and down depend on multiple faetors: power supply, total board capacities that need to be ^charged, power ; supply 
load and so on Figure 6 shows ideal worst-case sequencing for ramp up and down that ,s performed by the protection 
sequencing circuits shown in Figure 7. This circuit keeps the voltage difference within the required range^ 
The MPC7400 requires the core supply to not exceed the I/O supply by more than 0.4 volts at all tunes. Also, the I/O 
supply must not exceed the core supply by more than 2 volts. 
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Dl D2 



D3 D4 

-0HW- 



D5 



D6 



D7 D8 

-OH* 



1.8V.2 



D9 

-Kh 



Figure 7. VOLTAGE SEQUENCING CIRCUITS 

0.7 V voltage drops across one diode. 

D3 and D4 provide the ramp up voltage for the I.8VJ power supply as soon as he 2.5V pow upp y ache .4V. 
D7 and D8 provide the ramp up voltage for the 1 .8V_2 power supply as soon as the 2.5V power supply reaches 1 .4 V. 

S^^ESl. 2.5V power supply as soon as the 3.3V power supply reaches T8V. 
D6 provides the ramp down for the 1.8VJ power supply as soon as the 2.5V power supp y reaches . V. 
D9 provides the ramp down for the 1 .8V_2 power supply as soon as the 2.5V power supply reaches ..IV. 

The 3 3V power supply is connected to the VCC3P3 power plane. 
The 2 5V power supply is connected to the VCC2P5 power plane. 
The 1 8V_1 power supply is connected to the VCC1P8J power plane. 
The 1 8V_2 power supply is connected to the VCC1P8_2 power plane. 



A PLDis -gnals from the onhoard supplies. It is powered up front + 5V ^d — 

« 3V +2 5V 18V " ^ 1+1.8V 2. This circuit monitors the po W er_good signals from each supply. In the : c** of a 
J lZm™\« !« or more supphes, the PLD will issue a restart to all ^^SS 
card A latched power status signal will be available from each supply as part of the d.screte input word, 
discrete shall indicate any power fault condition since the last off-board reset cond.t.on. 
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5 ELECTRICAL INTERFACE 

5.1.1 Power Consumption 

Table 13. MCW-la CN Power Consumption 



Description 


Qty 


Total Typ. Power 


Total Max. Pwr. 


CE ASIC 


1 


IW 1 


1.5W j 


SDRAM 


5 


3W 1 


3.5W 


SBSRAM 


2 


1.2W 


2.5W 


G4 


1 


8W 


12W 


Oscillator 


1 


0.1W 


0.1W 


PLD 


1 


0.15W 


0.2W 











TBD 



Table 14. MCW-la Power Consumption 



TBD 



5.1.2 I/O 

5.1.2.1 Over-the-Top RACEway++ Interlink 

See Appendix A for the over-the-top RACEway-H- interlink connector pinout. 

5.1.2.2 PCI 32-Bit Modem Connector 

See Appendix B for the PCI 32-bit modem connector pinout. 

5.1.2.3 Ethernet 10/100BT 

See Appendix C for the Ethernet 10/100 BT connector pinout. 

5.1.2.4 PPC Debugger 

See Appendix D for the PPC Debugger connector pinout. 
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6 MECHANICAL 



6.1.1 Packaging 

The MCW-lis a dual-side PCB assembly. The board is designed to be used in a custom system. The MCW- 
1PCB is TBD thick and TBD layers. 

6.1.2 Physical Constraint 

The PCB board must comply with the Motorola daughter card form factor. 

7 ENVIRONMENTAL 

7.1 .1 Temperature & Air Flow 

Operating temperature: TBD 
Storage temperature: TBD 

7.1.2 Humidity 
TBD 



7.1.3 Operating Altitude 
TBD 



7.1.4 Shock & Vibration 
TBD 



7.1.5 Compliance 
TBD 



7.1.6 Reliability 

TBD 



8 SWITCHES & JUMPERS 

8.1 J22 Jumper 

Provisional Hotswap switch interface for the PXBO. 



J22 Ref. Des. 


Jumper Function 


1-2 


PXBO HS HNDL_SWhigh 


2-3 


PXBO HS HNDL_SW low 



8.2 JU Jumper 

Raceway clock master selection 



J11 Ref. Des. 


Jumper Function 


1-2 (open) 


MCW-1A Master 


1-2 (shorted) 


MCW-1ASIave 
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8.3 J10 Jumper 

FI Raceway XBREQI - XBREQO crossover. 



J10 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1 - 2, 7 - 8 


Crossover 



8.4 J4 Jumper 

F2 Raceway XBREQJ - XBREQO crossover. 



J4 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1 - 2, 7 - 8 


Crossover 



8.5 J3 Jumper 

F2 Raceway CBL_CLK_0 - CBL_CLK_I crossover. 



J3 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1 - 2, 7 - 8 


Crossover 



8.6 J9 Jumper 

Fl Raceway CBL_CLK_0 - CBL_CLK_l crossover. 



J9 Ref. Des. 


Jumper Function 


i 3-4,5-6 


Straight through 


1 - 2, 7 - 8 


Crossover 



8.7 J18 Jumper 

Miscellaneous control 



J18 Ref. Des. 


Jumper Function 


1-2 


WDM fail disable 


3-4 


Serial PROM write enable 


5-6 


FLASH write enable 


7-8 


PXBO use configuration PROM 


9-10 


Unused 



8.8 J21 Jumper 

Master clock source selector 



J21 Ref. Des. 


Jumper Function 


1-2 


F1 cable port master 


3-4 


F2 cable port master 


Both closed 


MCW-1A master 


Both open 


MCW-1A master 



9 TESTABILITY 



MC W- 1 a Functional Specification Created on 2/2/0 1 

- 34- 



Page No. 49 



EV 093 931 868 US 
Paqe No. 76 

y Mercury Computer Systems. Inc. COMPANY CONFIDENTIAL 



9.1 JTAG Test Scan 

The MPC7400, MPC8240, PCI-PCI bridge, PCE133 ASIC, PXB++ ASIC, XBAR++ ASIC, and the Ethernet 
controller provide support for the IEEE Standard 1 149.1 test port (JTAG). Refer to the individual component 
specifications to obtain their JTAG test access port (TAP) descriptions. 

The MCW-la board contains several JTAG scan chains. They provide access to the JTAG test port on the 
MPC7400s, MPC8240, L2 caches, XBAR++, PCE133s, Ethernet, PCI-PCI bridge, and the PXB devices. The 
scan chain is defined as; 
Chain 1 -> MPC7400.1 
Chain 2 -> MPC7400.2 
Chain 3 -> MPC7400_3 
Chain 4 -> MPC7400_3 
Chain 5 -> MPC8240 

Chain 6 -> RESET.PLD, PCEFIXl.PLD, NODE0_HA_PLD, NODEl_HA_PLD, PCEFIX2_PLD, 
NODE2_H A_PLD, NODE3_HA_PLD, 8240_DECODE_PLD, VOTER_SYNCJ>LD, 8240_HA_PLD, 
PXB_PROM, L2 CacheJ, PCE133J, L2 Cache_2, PCE133_2, XBAR, L2_Cache_3, PCE133_3, L2 
Cache_4, PCE133_4, PXB-H-, PCI-PCI Bridge, Ethernet 



The scan path is accessible via connector J 16. The enable for the scan chain buffer is controlled by jumper 
J20. 

The RACEway++ interlink external connectors will be tested with external loop-back connectors. 

Note: Both the RACEway++ clock (66 MHz) and the PCI clock (33 MHz) must be running to allow the scan path in 
the PXB to function properly. 



10 RACEway++ Over-the-Top Connector Pinout 
Table 15. RACEway4-+ Fl Cable Mode Connector Pinout J-l 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_XJXlJO 


A2 


GND 


B2 


JXl_CBL_CLKJO 


A3 


GND 


B3 


JX1_XBREQJ 


A4 


GND 


B4 


JX1_XBREQ_0 


A5 


GND 


B5 


JX1_XBSTROBIO 


A6 


GND 


B6 


JXl.XBRPLYlO 


Al 


GND 


B7 


JXl_XBRDCONIO 


A8 


GND 


B8 


JXl_XBIO00 


A9 


GND 


B9 


JXl_XBIO01 


AI0 


GND 


BIO 


JXLXBIO02 


Alt 


GND 


Bll 


JXl_XBIO03 


A12 


GND 


B12 


JXl_XBIO04 


A13 


GND 


BI3 


JXl_XBIO05 


A14 


GND 


B14 


JXl_XBIO06 


AI5 


GND 


B15 


JXl_XBIO07 


A16 


GND 


B16 


JXl_XBIO08 


A17 


GND 


B17 


JXl_XBlO09 


A18 


GND 


J B18 


JXl_XBIO10 
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AI9 


GND 


R 1 o 


1 Y 1 YRini 1 
J A l_ADlU 1 1 


AzU 


/^\| pv 

CjNU 


Ron 

t>ZU 


J A l_ADIU 1 Z 


A "> 1 

A21 




ro i 


J A 1 _AD1U 1 j 


A22 


ONU 


ROO 
DZZ 


IYI YRinid 
J A 1 _ A 


A23 


/-> XI P\ 

CjNU 


dzJ 


IYI YRIOl ^ 
J A l__ADlvy 1 J 


A T /t 

A24 


ONU 


dz4 


1Y1 YR1H16 
J A 1 _ A DJUID 


Azo 


CjNU 




IYI YR1017 
J A 1 A D1W J / 


A26 


ONU 


ROA 
DZO 


1Y1 YRIOIR 
J A 1 AD1U 1 O 


A27 


/""" XI Pk 

ONU 


DZ / 


IYI YRIOIQ 

J A 1 _ A DIUI 7 


Aza 


ONU 


DZO 


IYI YR1O90 

J A I_ADIvtu 


A29 


ONU 


Dzy 


IYI YRinoi 

J A l_ADJUZ 1 


A30 


ONU 


Din 


1Y1 YR1O90 
J A 1_ADIWZZ 


A31 


X I pv 

ONU 


R1 1 


1Y1 YR1A97 
J A I_AdIUZj 


All 


VJ IN 1^/ 


B32 


JX1 XB1024 


A33 


GND 


B33 


JX1_XB1025 


A34 


GND 


B34 


JXI_XB1026 


A3 5 


GND 


B35 


JX1_XBI027 


A36 


GND 


B36 


JX1_XB1028 


A37 


GND 


B37 


JX1_XBI029 


A38 


GND 


B38 


JXl_XBIO30 


A39 


JX1_XBPAR 


B39 


JX1_XBI031 


A40 


+3.3V 


B40 


R.RSTJX 



Table 16. RACEway++ F2 Cable Mode Connector Pinout J-2 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_XJX2_IO 


A2 


GND 


B2 


JX2_CBL_CLKJO 


A3 


GND 


B3 


JX2_XBREQJ 


A4 


GND 


B4 


JX2_XBREQ_0 


A5 


GND 


B5 


JX2_XBSTROBIO 


A6 


GND 


B6 


JX2_XBRPLYIO 


A7 


GND 


B7 


JX2_XBRDCONIO 


A8 


GND 


B8 


JX2_XBlO00 


A9 


GND 


B9 


JX2 XB1O01 


A10 


GND 


BIO 


JX2_XBIO02 


All 


GND 


Bll 


JX2_XB!O03 


A12 


GND 


B12 


JX2_XBlO04 


A13 


GND 


B13 


JX2_XBIO05 


A14 


GND 


B14 


JX2_XBIO06 


A15 


GND 


B15 


JX2_XBIO07 


A16 


GND 


B16 


JX2_XBlO08 


AI7 


GND 


B17 


JX2_XBlO09 


A18 


GND 


B18 


JX2_XBIO10 


A19 


GND 


B19 


JX2_XBI011 
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A?0 


HMD 


B20 


JX2 XB1012 


A91 




B21 


JX2 XB1013 


A22 


GND 


B22 


JX2_XBI014 


A?1 


GND 


B23 


JX2_XBI015 


A?4 




B24 


JX2 XB1016 


A7S 


GND 


B25 


JX2 XB1017 


A?6 




B26 


JX2 XBJ018 


A?7 




B27 


JX2 XB10I9 


A?8 


GND 


B28 


JX2_XBIO20 


A?Q 


GND 


B29 


JX2_XBI021 


A^O 


GND 


B30 


JX2 XBI022 


All 


GND 


B31 


JX2_XB1023 


A32 


GND 


B32 


JX2_XBI024 


A33 


GND 


B33 


JX2_XB1025 


A34 


GND 


B34 


JX2.XB1026 


A3 5 


GND 


B35 


JX2_XBI027 


A3 6 


GND 


B36 


JX2_XBI028 


A37 


GND 


B37 


JX2_XBI029 


A38 


GND 


B38 


JX2_XBIO30 


A39 


JX2_XBPAR 


B39 


JX2_XBI03) 


A40 


+3.3V 


B40 | R_RST_JX 
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1 1 Modem Board Connector Pinout 

Table 1 7. Modem Board Connector Pin Assignments 



J29 


Pin 


Signal 


Signal 


Pin 


1 


5V 


PMC_AD0 


2 


3 


5V ! 


PMC.AD1 


4 


5 


5V j 


PMC_AD2 


6 


7 


5V 


PMC_AD3 


8 


9 


PCI_RST_0 


PMC_AD4 


10 


11 


GND 


PMC_AD5 


12 


13 


GND 


PMC_AD6 


14 


15 


PMC_IDSEL_1 


PMC_AD7 


16 


17 


5V 


PMC.AD8 


18 


19 


5V 


PMC_AD9 


20 


21 


PMC.TRDYJ) 


PMC_AD10 


22 


23 


GND 


PMC.AD11 


24 


25 


GND 


PMC.AD12 


26 


27 


PMC_STOP_0 


PMC.AD13 


28 


29 


5V 


PMC_AD14 


30 


31 


5V 


PMC_AD15 


32 


33 


PMC_PERR_0 


PMC.AD16 


34 


35 


GND 


PMC_AD17 


36 


37 


GND 


PMC_AD18 


38 


39 


PMC_SERR_0 


PMC_AD19 


40 


41 


5V 


DKilp Anon 


42 


43 


5V 


PMC.AD21 


44 


45 


CLK_PMC 


PMC_AD22 


46 


47 


GND 


PMC_AD23 


48 


49 


GND 


PMC.AD24 


50 


51 


PMC_C_BE0 


PMC_AD25 


52 


53 


PMC_C_BE1 


PMC.AD26 


54 


55 


5V 


PMC.AD27 


56 


57 


5V 


PMC_AD28 


58 


59 


PMC_C_BE2 


PMC.AD29 


60 


61 


PMC_C_BE3 


PMC_AD30 


62 


63 


GND 


PMC.AD31 


64 


65 


GND 


5V 


66 
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67 


GND 


PMC_FRAME_0 


68 


69 


PMCJNTAJ) 


GND 


70 


71 


GND 


PMCJRDY.O 


72 


73 


GND 


5V 


74 


75 


PMC_GNT_0 


PMC_DEVSEL_0 


76 


77 


5V 


PMC_LOCK_0 


78 


79 


PMC_REQ_0 


PMC.PAR 


80 
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12 Processor JTAG Connector Pinout . 

The JTAG connectors are unique to each processor. Table 18 shows the generic signal names on each connector pin, the 
actual names will have each processor's extension appended to the generic signal name. 
Table 18. JTAG Jx Connectors Pin Assignments 



Jx- 


SIGNAL 


Jx- 


SIGNAL 


1 


TDO 


2 


OACKN 


3 


TD1 


4 


TRSTN 


5 


HALTEDN 


6 


3.3V 


7 


TCK J 


8 


CKSTOPJNN 


9 


TMS 


10 


N.C. 


11 


SRESETN 


12 


N.C. 


13 


HRESETN 


14 


«key» 


15 


CKSTOP_OUTN 


16 


GND 
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13 Non-Processor JTAG Connector Pinout 

The non-processor JTAG connector ties together all the remaining JTAG capable devices together. Table 1 M 
signal names on each connector pin. The connector is designed to only include the programmable PLDs and PROM 
when the program cable is installed, or the entire chain when the Boundary scan test connector is installed. 
Table 1 9. JTAG J 1 6 Connectors Pin Assignments 



J16- 


Signal 


Description 




TMSJTAG 


JTAG Test Mode Select 


2 


TDI JTAG 


JTAG Test Data In 


3 


TDOJTAG 


Boundary Scan Test Data Out 


4 


TESTN 


Driven low when connector inserted 


5 


TCK JTAG 


JTAG Test Clock 


6 


GND 


Ground on module 


7 


PXB_CNF_TDO 


TDO from end of PLD chain 


8 


TDI NDO 


TDI into non-PLD Chain 


9 


+5V 


+5V Power on Module ! 


10 


TEST 


Driven high when connector inserted 



TMS 
TDI 
TCK 
TDO 
Power 



PLD Program Configuration 

£^]^pid3--)| ptomJ— 



- J16-1 TMS.JTAG 

-J16-2 TDLJTAG 

- J16-5TCK_JTAG 

- J 16-7 PXB_CNF_TDO 

- J 16-9 Power 
TESTN 
GND 



TMS 
TDI 
TCK 



TDO 
Power 



Boundary Scan Test Configuration 



J 16-1 TMS_JTAG 
J 16-2 TDLJTAG — 
J 16-5 TCK. JTAG 

• J 16-7 PXB„CNF_TDO 

- J16-8 TDI_ND0 — 

* J 16-3 TDO_JTAG ±- 

- J 16-9 Power 
TESTN 
GND 



Figure 8. JTAG CONNECTOR CONFIGURATION OPTIONS 
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14 Design Notes 

14.1 MPC7400 and Nitro Bus Signaling Voltage Support 





1.8V 


2.5V 


3.3V 


MPC7400 V 60x 


Yes 


Yes 


Yes 


MPC7400 V L2 


Yes 


Yes 


Yes 


Nitro V 60x 


Yes 


Yes 


No 


Nitro V L2 


Yes 


Yes 


No 


PCE133 V60x 


No 


Yes 


No 


SBSRAM Vi/o 


No 


Yes 


Yes 



14.2 Bypass Capacitors Selection 

(Based on App. Note from Micron TN-00-06) 



Vcore * 3.3V +/- 0.165V, which is 5% 
Vi/o = 2.5V +/- 0.125V, which is 5% 

When the SBSRAMs are driving 21 pf load from 0V to 2.5V with Ins edges, the transient current is: 
1 = (C * dV)/dt = (30pP2.5V)/lns = 75ma per one I/O pin. 
For 36 I/O, 36*75ma = 2.7A in Ins time interval. 

The SyncBurst SRAM has a VDD tolerance of 3.3V +/-0.165V. Considering some droop from the power bus and a switching 
time of 1 ns, and allowing a maximum voltage dip (DV) on the SRAM of -0.05V, the choice of bypass capac.tor becomes: 

C = ( I * dt)/dV = (2.7A * l)/0.05 = 54nF per one SBSRAM. 



Choosing 6 x lOnf allows some margin. 

It is better to use reverse ratio capacitors 0508, 0406, or 0204. 

The low ESR is also very important. 

Temperature stable dielectric as X7R. 

From Vishay VJ0402 style X7R. 



14.3 Tantalum Capacitors Selection 

Ultra-low ESR tantalum capacitors T510 are used in the switching power supply, besides several bulk storage capacitors 
distributed around the PCB that feed Vcore and Vi/o plains, to enable quick recharging of the bypass chip capacitors. 
The number of the bulk-storage tantalum capacitors depends on the power supply response time characteristic. 

The MPC7400 can go from nap mode to full-on mode power within two cycles. 



I core - ( 1 OW - 2W) /l .8V = 4.5A 
dt= 10|is 

C = (1 * dt)/dV = (4.5 A * 1 Ops) / 0.05V = 900^F 
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1.2. Red-Box 4 

Transform Object Sample 5 

2.1. Include the following files to define the interface, and variables required 5 

The contents of dx_dma_var.h: 5 

2.2. Initialize the interface 5 

2.3. Receive input £ 

2.3. 1 . An Example of the receiving of data from input pin 0 o 

2.4. Send Output 1 

2.4.1. An Example of the sending data on output pin 0 ' 

. Transforms for WCDM Simulation: 8 

3.1. handset (one of n): 8 

3.1.1. input pins:: 8 

3.1.2. Output pins: , 8 

3.2. Chan (set of one to m objects): 8 

3.2.1. Input pins: 8 

3.2.2. Output pins: 9 

3.3. broadcast (set of one to k objects): 9 

3.3.1. Input pins: 9 

3.3.2. Output pins: 9 

3.4. Rake (one of n): 9 

3.4.1. Input pins: |° 

3.4.2. Output pins: J® 

3.5. MUX (set of one to L objects): 10 

3.5.1. Input pins: 10 

3.5.2. Output pins: 10 

3.6. MUD (one object for now): 10 

3.6.1. Input pins: • 11 

3.6.2. Output pins: 

3.7. BER (set of one to m objects): 1 1 

3.7.1. Input pins: || 

3.7.2. Output pins: 11 
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1. Introduction , . rototype framew0 rk and how to use it. The purpose of this 




Xhe aho V e fig u^ 
Application that is managed by the Apput 
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WCG Framework 



DMA transfer ) Red 3cx 




1 .1 . Transform Object 

The transform object is the basic building block and can be like a Turbo -coder, QAM modulator 

etc. 

1.2. Red-Box 

The red-box collects transform objects into a logical grouping that describes all of the processing 
that will be carried out on a single CPU.. (Note for reasons of non real-time operation eg 
simulation collections of red-boxes can be on a single CPU). 
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2. Transform Object Sample 

2.1. Include the following files to define the interface, and variables 
required.. 

#include "mc_error.h" 
#include "mcwl.h" 
#include "dx_dma.h M 
#include "dx_dma_var.h" 

2.1 .1 . The contents of dx_dma_var.h: 

int my_logical_ce; 
GONFIG_data *ptr_conf ig_base; 
CONFIG_data *ptr_cur_conf ig; 
CONFIG_data *ptr_tmp_conf ig; 

int active_in_ce[ (MAX_CE+1) * MAX_CHAN] ; 

int active_in_ch[ (MAX_CE+1) * MAX_CHAN] ; 

int active_in_buf_size[ (MAX_CE+1) * MAX_CHAN] ; 

char *active_in_buf t(MAX_CE+l) * MAX_CHAN] ,- 

int active_in_index; 

int active_out_ce[(MAX_CE+l) * MAX_CHAN] ; 

int active_out_ch[(MAX_CE+l) * MAX_CHAN] ; 

int active_out_buf_size [(MAX_CE+1) * MAX_CHAN] ; 

char *active_out_buf [{MAX_CE+1) * MAX_CHAN] ; 

int active_out_index; 

tfdefine dma_send_pin(pin) \ 

dma_send ( \ 
my_logical_ce, \ 
active_out_ce [pin] , \ 
active_out_ch [pin] , \ 
(char **) &active_put_buf [pin] \ 
) 

tfdefine dma_rec_pin (pin) \ 

dma_rec ( \ 
active_in_ce [pin] , \ 
my_logical_ce, \ 
active_in_ch [pin] , \ 
(char **) &active_in_buf [pin] \ 
) 

2.2. Initialize the interface 

// get config SMB 

dma_all_init( 
my_Jogical_ce, 
active_in_ce, 
activejn_ch, 
active_in_buf_size, 



5 of ll 



Page No. 62 



EV 093 931 868 US 
Page No. 89 



MEMO AF-4 
Prototype Framework V0. 1 



active_in_buf, 

(int *)&active_in_index, 

active_out_ce, 

active_out_ch, 

active_out_buf_size, 

active_out_buf, 

(int *)&active_out_index, 

(CONFIG.data **)&ptr_config_base 

); 

ptr_cur_config = &ptr_config_base[myJogicaLce]; 
#ifdef debug_print 

printf("Vir CE %i, module name is %s\n", 
myJogicaLce,ptr_cur_config->module_name); 

#endif 

ptr_cur_config->state = STATE_RDY; /* all init done now ready */ 
//wait for rx to be ready 

ptr_tmp_config = &ptr - config_base[active__out_ce[0]]; 
while (ptr_tmp_config->state != STATE_RDY) //need reciver to be ready 

sched_yield(); 

//wait for tx to be ready 

ptr jmp_config = &ptr_config_base[activejn_ce[0]]; 
while (ptr_tmp_config->state != STATE_RDY) //need reciver to be ready 

sched_yield(); 

#ifdef debug_print 

printf("\nCE %i, Virtual CE %i, Starting\n ,, ,(int)ce_getid(),my_logicaLce) 
#endif 

2.3. Receive input 

Receive input data if required, input pins can be left unused. 

2.3.1. An Example of the receiving of data from input pin 0 

/* get data from other CE */ 
rc = dma_rec_pin(0); 
ERROR_MCWl(rc); 

OR 

rc = dma_rec( 
active_in_ce[0], 
my_logical_ce, 
active_in_ch[0], 
(char **)&active_in_buflO] 

); 

ERROR_MCWl(rc); 
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The data is available in the active_in_buf pointer,, note this always points to the next available 
input buffer in the case of multi -buffering,, at a later date the size of input chunk and offset will be 
provided so that a FIFO like structure can be used. 

2.4. Send Output 

Send output data if required, output pins can be left unused. 

2.4.1. An Example of the sending data on output pin 0 

/* send data to other CE */ 
rc = dma_send_pin(0); 
ERROR_MCWl(rc); 

OR 

rc = (long)dma_send( 
myJogical_ce, 
active_out_ce[0],$ 
active_out_ch[0], 
(char **)&active_out_buf[0] 

); 

ERRORJrfCWl(rc); 

The data in the active_out_buf pointer will be sent, on return this always points to the next 
available output buffer in the case of multi -buffering. At a later date the size of output chunk and 
offset will be provided so that a FIFO like structure can be used. 
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3. Transforms for WCDM Simulation: 
3.1 . handset (one of n>: 

This object has two input pins and one output pin. It performs the: 

1 . Generate transport channel 

2. MUX and channel coding 

3. Generate TX waveform 

4. Simulate RX system for Power control etc. 

5. Outputs to the chan model 

3.1.1. input pins: 

3.1.1.1. power_control pin 0 : 

Input to this pin is from output pin 0 of the rake block and is the slot power control. 

3.1.1.2. next_chunk pin 1: 

Input to this pin is from output pin 1 of the BER block and is the send next n symbols for 
processing e.g. 2 symbols, or a slot etc. 

3.1.1.3. next_chunk pin 1: 

Optional input pin, used to provide external ie outside of the Generate traffic channel bits, access 
to the raw data input ie if we did a codec the output of the codec would go into this block. 

3.1.2. Output pins: 

3.1.2.1. signaLout pin 0 : 

This pin goes to one input pin of the chan object group. 

3.1.2.2. raw_bits pin 1: 

This pin has the raw data bits as encoded into the Data channel so that the BER, BLfcK 
calculations can be done. 

3.2. Chan (set of one to m objects): 

In this group of objects, each has; two to n input pins; and one output pin each. They collectively 
perform the: 

1 . Channel model for each of the inputs except the carry pin 

2. Sums the local signals, and adds the carry input pin 

3. Outputs to the front_end object to send same data to all rake inputs 

3.2.1. Input pins: 

3.2.1.1. sum_in pin 0 : a 
Input to this pin is from output pin 0 of other channel object, currently a dummy input is required 
on this pin for the process to fire (needs more thought ie a special first chan??). 

3.2.1.2. signaUn pin 1 to n: 

Input to this pin is from output pin 0 of the handset block. 
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3.2.2. Output pins: 

3.2.2.1. signaLout pin 0 : 

This pin goes to input pin 0 of the broadcast object. 

3.3. f rorrt_end (one object): 

In this object, each has; one input pin; and one output pin. It performs the: 

1 . Adds th e multipl e antenna, af^ #H^Recdver distortions and noise 

2. Simulate RX system ( AGC A/D, multiple antennas) etc. 

3. Outputs to the broadcast object to send same data to all rake inputs 

Multi ple antennas should he treated a s se earate data streams The rake receiver will process them 
independently, until the MRC -stage. 

3.3.1. Input pins: 
3.3.1.1. signal Jn pin 0 : 

Input to this pin is from output pin 0 of the last channel object. 

3.3.2. Output pins: 

3.3.2.1. signaLout pin 0 to n: 

This pin goes to input pin 0 of the broadcast objects. 
3.4. broadcast (set of one to k objects): 

This object is required to simulate broadcast, until the simple framework supports this feature, we 
need this object. 

Each object in the group has one input pin and one to n output pins. They collectively perform 
the: 

1 . Takes one input and copies it to all of the output pins un-modified 

2. Outputs same data to all rake input 0 pins. 

3.4.1. Input pins: 
3.4.1.1. signal Jn pin 0 : 

Input to this pin is from output pin 0 of the front_end object. 

3.4.2. Output pins: 

3.4.2.1. signaLout pin 0 ton: 

This pin goes to input pin 0 of the rake objects. 

3.5. Rake (one of n): 

This object has one input pin and two output pins. It performs the: 

1. AGG: AFC 

2. Tnit^t signal acquisi tion and sS earcher recciyerWX 

3. Multiple finger remversfe 
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4. Channel estima tion JvlRC etc. 

5. Final data channel desoreading. 
#r6. Outputs to: 

• MUD »rouo of processes 

• Soft-decision sv mhol processing ( FEC decoding ^d emultiplexing (25.212) 



3.5.1. Input pins: 

"fc broadcast set, and carries the signals of all the handsets, and noise etc. 

3.5.2. Output pins: 

3.5.2.1. power_control pin 0 : 

This is the slot power control to be sent back to the handset. 

3.5.2.2. signal_out pin 0 : 

This pin goes to one input pin of the MUX object group. 

3.6. MUX (set of one to L objects): ■ ™ 

This object is required to gather and package information from the 1 to n rake objects. The inputs 
are p2d into p^cketst???) or into arrays (???) To Be Determined (TBD). This object should be 
morphed into the best approximation of the packaging to be provided by a targeted modem. 
Each object in the group has one to n input pins and one output pin. They collectively perform 
the: 

1 Package rake information into simulated modem sourced data. 
2*. Outputs to MUD input 0 pin (for now until MUD integration there will be a dummy 
placeholder block). 

3.6.1. Input pins: 

Irfpui^ Pin 1 of the a rake object, or another MUX objects output pin 0 . 

3.6.2. Output pins: 

3.6.2.1. signaLout pin 0 : 

This pin goes to input pin 0 of the rake objects. 

3.7. MUD (one object for now): 

This object is required to place hold until a real mud is implemented. 
MUD has one input pin and one output pin. 

1 . Passes through data and formats it for the BER block 

2. Outputs to BER input 0 pin. 
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3.7.1. Input pins: 
3.7.1.1. signal_in pin 0 : 

Input to this pin is from output pin 0 of the MUX object. 

3.7.2. Output pins: 

3.7.2.1. signal_out pin 0 : 

This pin goes to input pin 0 of the BER object. 

3.8. BER (set of one to m objects): 

This object is required to gather and package information from the 1 to n handset objects and the 
MUD. The inputs are placed into packets(???) or into arrays (???) To Be Determined (TBD). 
This object should be morphed into the best approximation of the packaging to be required by a 
targeted modem. It also compares the raw input data and raw received data. It also does the FEC 
detection and correction and Block error rate. 

Each object in the group has one to n input pins and one to n+1 output pins. They collectively 
perform the: 

1 Package rake/MUD information into simulated modem destination data. 

2. Perform all of the bit level processing, interleaving, FEC, - This should be in a separate block. 

3. BER, BLER etc. BLER should be done via the CRC check, a fter all symbol decoding is 
performed . 

3:4. Outputs to GUI input 0 pin to display the stats. 

4t5. Outputs the generate the next slot command to the one to n handsets. 

3.8.1. Input pins: 
3.8.1.1. signal Jn pin 0 torn: 

Input to this pin is from output pin 0 {for now until MUD integrated} of the MUD object, or 
another output pin 0 of a BER object. 

3.8.2. Output pins: 

3.8.2.1. stats_out pinO : 

This pin goes to input pin 0 of the host object for display of data on the GUI. 

3.8.2.2. next_slot pin 1 (one of n): 

This pin goes to input pin 1 of the handset object to indicate the system is ready for the next slot 
of data. 
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From- Jon Greene <greene@mc.com> 

To: "Lauginiger, Frank" <fpl@mc.com>, <joates@mc.com>, <afuchs@mc.com>, 

<mvinskus@mc.com> 

Date: 6/23/00 3:05PM 

Subject: Some MUD analysis 

All: 

Obviously, I've been thinking about MUD a lot. Below is some analysis. 

First some news. We apparently have 400 Mhz, 2 meg / 266 Mhz L2 Nitros in 
house (samples). Vitaly is presently working to bring them up. This is 
excellent news. Besides the above speed/size properties, Nitros use 
significantly lower power than Max's and allow for varying L2 configuration 
options. Nitro L2's can be configured the normal way (as a cache) or all or 
half (1 meg) as SRAM memory and can be addressed as such directly. For 
example one can write a buffer into this memoiy with vmov or, better yet, 
as the output of some computation. I'm not sure if it could be the source or 
tarqet of a RACEway xfer but we should try to find this out. Even if 
configured as a coherent cache, it can be easily locked and unlocked in iuser 
mode. I think configuring as 2 meg of SRAM may work the best for MUD but we 
should determine this empirically. 

Now a critical analysis of ops, buffer sizes, bandwidth, access patterns, 
algorithm structure and phases of the moon, are all essential to arnvmg at 
a strategy that stands a chance of working. This of course is not easy 
because various techniques impact all of the above in unequal ways. Let's 
just consider the R1/R1m R-matrix processing on the above Nitro with a 
maximum of 100 users. 'Without* taking advantage of the diagonal symmetry in 
the Corr matrix, which I now believe will be very difficult to do in the 
R-matrix ucoded processing loop(s) (we should discuss this), but still 
assuming Corr *can* effectively exist as a byte matrix without degrading 
accuracy beyond acceptability, a single plane (i.e., a processor's worth) of 
the Corr matrix requires 200 * 200 * 32 = 1 ,280,000 bytes which fits, albeit 
uncomfortably, into the L2. At 2 gigabyes/sec (- 266 * 8), this matrix (if 
L2 resident) can theoretically be consumed in 0.64 ms (remember, 1.33 ms. is 
our budget) Now, *if * we go with a completely separate X matrix calculation 
without stripmining *and* we also store it as byte values, it would require 
at most 100 * 100 * 32 = 320,000 bytes. This must be entirely produced and 
consumed in the 1 .33 ms. time slice. In 'theory*, this can be done in 0.32 
ms Finally the Rljemp output is of size 200 * 200 = 40,000 bytes andean 
be produced in .02 ms. So, with the fully separate X matrix approach and no 
symmetry in the Corr, we theoretically require - 1 ,750.000 bytes of buffer 
size (I added a little more for stray stuff such as the C vectors and the 
phys <=> virt Luts, etc.) and -1.0 ms. to produce and consume these buffers. 
If we stripmined X, which seems a better way to go, we could hopefully keep 
it resident in L1 , thereby reducing L2 buffers to -1 ,350,000 bytes and 0.7 
ms of L2 I/O. The stripmining also allows us the option of keeping the X 
strip as shorts rather than bytes. 

Now lets consider the ops count. For the R1/R1m processing (including the 
qeneration of the X matrix and 2 antennas), I come up with (2 * 6 100 
100 " 16 + 4 * 200 * 200* 16) * 750 = (1920,000 + 2,560,000) 750 = 3.36 
GOPS (BTW if you were wondering, 750 = 1000/1.33.) The R0 processing has 
less GOPS due to the symmetry. I get (1920,000 + 2,560,000/2) * 750 = 2.40 
GOPS Since the R0 and R1/R1m processing use the same X matrix, we may De 
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tempted to consider having only the RO processor compute the X matrix and 
ship it to the R1/R1m processor. This looks nice from a GOPS perspective (RO 
= 2.40, R1/R1m = 1.92) but I'm not sure it will work very well given the 
lockstep nature of the processing pipe. For example, will the R1/R1 m 
processor simply be idle waiting for the X matrix or will it be completing 
the 'prior* R1 Jemp processing while the RO processor is computing the 
current X? 

But the real killer about having RO ship X to R1/R1m is that the X matrix 
(320,000 bytes) will take at least 1 .23 ms. over RACE++ 
(320,000/260,000,000). And let's not forget the 40,000 byte R Jemp output 
matrix that has to also be shipped out in the same time frame. So I don't 
think this OPs balancing approach will work. 

We therefore appear to require 3.36 GOPS out of R1/R1 m and we might just not 
even bother with the R0 symmetry since it doesn't buy you very much given 
that mpic needs both R0 and R1/R1m as inputs. In other words, have both 
R-matrix processors run essentially the same code. (Will this work?) 

Now 3.36 GOPS out of one processor is a tall order. We may have to resort to 
a more asymmetric division of labor (The R0 processor takes advantage of the 
R0 symmetry and also does a portion of R1/R1m). But, Pd like to pursue the 
more balanced division until we are absolutely sure it won't work. 

It this approach, both the R0 and R1/R1m processors independently produce 
and consume X in strips. A variant could instead produce and consume a 
single "value" (actually 32 shorts) of X in a single ucode primitive that 
does both the complex multiplies and the dot products (the MUDder of all 
primitives). The former is certainly the easier approach and might get us 
all the way there but the latter, if it can be cleverly coded, may perform 
better. In all cases, the ops don't change but at least the 12 gets some 
breathing room. 

In any event, the so-called dot-product loop, whether it's separate or 
includes the complex multiply, still remains a difficult piece of code to 
fully optimize if we allow the number of virtual to physical users to vary 
as MUD (and Dr. Oates) demands. Using a LUT to acquire the index list and 
count of virtual users for a given physical user will tend to throttle the 
dot product code due to short vector lengths, funny address calculations, 
and "random" load and store patterns. The load isn't so bad since it's two 
cache lines no matter where it comes from. We may want to reorder Corr 
anyway just to ease the address arithmetic and DST logic. We could also 
simply store in the order we produce and leave it to the mpic processor to 
reorder (poor guy). As for the short vector count, I think this can be 
overcome with a clever primitive that "pauses" as little as possible between 
index lists but this will take some careful design. 

I think we should try for the "balanced" stripmine approach with essentially 
the same two primitives running in each processor. In the absence of 
dissenting views, I will continue modifying the C code to realize this 
structure. I'm still not sure where the Amp/fac_xx multiply(s)/shift(s) 
belong but for now I'll rid them entirely from the R-matrix functions that 
I'm preparing for ucoding. 

- Jon 
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CC: "Kenny , Jamie" <jfk@mc.com> 
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_ M Computer Systems, Inc 

Mercury 

199 Rivemeck Road 

Chelmsford, MA 01824-2820 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: Channel Estimation Date: October 20, 2000 

1. Introduction 

In the conventional RAKE receiver, channel amplitude 1 estimation is required for maximal 
ratio combining the RAKE fingers. The BER performance is not strongly dependent on the 
accuracy of the channel amplitude estimates. For Multi-User Detection (MUD) the channe 
amplitude estimates are used for signal subtraction, and accuracy of the channel 
amplitude estimates is more critical. In addition, the channel estimation error is larger 
when MUD is used since channel estimation is performed in a higher interference 
environment. This report investigates the accuracy of the conventional channel amplitude 
estimation techniques under elevated multiple access interference. The effect of channel 
amplitude estimation error on MUD efficiency is then assessed. The analysis presented 
here is intended to be a first-look. There are a number of ways to increase the channel 
amplitude estimation accuracy. A few of these are discussed below. 

Section 2 presents a model for the received signal and match-filter outputs. The effect of 
channel estimation error on MUD efficiency is addressed in section 3. In section 4 the 
accuracy of the conventional channel amplitude estimates is assessed. In section 5 
improved single-user methods are presented for channel amplitude estimation. Section 6 
presents a multi-user channel amplitude estimation method. Section 7 addresses the 
effect of uncancelled multipath on the MUD efficiency, which is used m section 8 to 
assess the effect of dropping small amplitudes. It is shown that the overall MUD efficiency 
is improved by dropping small amplitudes. Conclusions are drawn in section 9. 

2. Signal Model and Matched-Filter Outputs 

The baseband received signal can be written 



1 Amplitudes are complex and hence include magnitude and phase. 
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where f is the integer time sample index, T = NN C is the data bit duration, N = 256 is the 
short-code length, N c is the number of samples per chip, w[t] is receiver noise, and where 
s k [t] is the channel-corrupted signature waveform for virtual user k. For L multipath 
components the channel-corrupted signature waveform for virtual user k is modeled as 



where a kp are the complex multipath amplitudes. Notice that a kp = a, p if k and / are two 
virtual users corresponding to the same physical user. This is due to the fact that the 
signal waveforms of all virtual users corresponding to the same physical user pass 
through the same channel. For multiple antennas a kp is a vector. For dual antennas, for 
example, primary and diversity, 



(3) 



The waveform s k [t] is referred to as the signature waveform for the kth virtual user. This 
waveform is generated by passing the spreading code sequence c k [n] through a pulse- 
shaping filter g[t] 



s k [t] = i,g[t-rN c ]c t [r] 



(4) 



r=0 



where N = 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the received signal r[t] represents the 
baseband signal after filtering by the matched chip filter. Note that for spreading factors 
less than 256 some of the chips c k [r] are zero. 

Combining Equations (1) through (4) gives 

*] = tit <W -fUT-r^m + Mf) (5) 

The output of the despreading operation for a single multipath component is the complex 
statistic 
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y lq [m] = - r[«yV f + f „ + mT] ■ c, [n] 



C,^[/n'] ^I^W + m 'T + ii g -t ¥ Y c,[n] 



w lq [m) = — X +f„ + m7] • c ( '[n] 

^™ / /I 

where f„ is the estimate of t„, and /V, is the (non-zero) length of code c/4 The values 
yiq [m] are complex and are referred to as the pre-MRC matched-filter outputs. For multiple 
antennas, r[t], w[t], y lq [mj and wjm]axe column vectors. 

The matched-filter output is then 

y,[m] = Rej2fl£;y,,[m]J 

-ZX r »t m ' ]t * [m_m ' 1+vv ' lml 



! ^=1 k-\ m p=\ 

■b k [m-m'l + W/tm] 



(7) 



p=i J 



where fijj is the estimate of < and W/fmJ is the match-filtered receiver noise. The terms 
for m'* 0 result from asynchronous users. 
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3. Effect of Amplitude Estimation Error on MUD Efficiency 

MUD efficiency is defined in terms of the ratio of the intra-cell interference with MUD (U 
to the intra-cell interference with the Matched Filter (MF), that is, the intra-cell interference 
without MUD {Imf): 

n = i _ Lam (8) 

Pmud - 1 . 

1 MF 

The total interference without MUD is l UF + J, where J is the inter-cell interference^ 
Similarly the total interference with MUD is l MUD + J. The ratio of inter-cell interference to 
intra-cell interference without MUD is denoted / = JII MF . The increase insystemcapajis 
equal to the ratio of the total interference without MUD to the total interference with MUD 

which is (l MF + J)/(I MUD * J) = OMf * firW-uD + <W - d + W ~ *» + f >' ^ °" 3 ^ 
B MUD = 0.7, MUD increases the system capacity by a factor of 1.3/(1 - 0.7 +o.d) - 
Hence, if our goal is to double system capacity the MUD efficiency must be approx.mately 
70% or greater. 

In the following we estimate the loss in MUD efficiency, 1 - /W due to imperfect channel 
estimation. For simplicity of presentation we consider approximately synchronous users. 

Recall that in a synchronous system the matched -filter outputs can be expressed as 



and that the intra-celi interference is then 

<„, - t*j ,10 » 



The effect of channel amplitude errors is that the estimates dteR^xdemerta (r«) 
are imperfect, which reduces the interference that is cancelled. When MUD is employed 
with imperfect R-matrix estimates the detection statistic is 



y,- Ya" r «< b « = 



(11) 



where for the present case we have assumed that the bit estimates are perfect. With 
MUD the intra-cell interference is 
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= X*k-'«) 2 } 



Now from Equation (7), specialized for synchronous users 



r lk = Re 



L L | 

,=i P =i j 



1 9=1 />=> 



(12) 



(13) 



Hence the second-order statistics are 



* 9-P=l f'.p'" 



2/v, 9 , p= , 2/v, 9 , p= i 

= ^rlA:lEl[2 + 2Re( P/ ,p;)] 

K s l 2 }= £ f «-* I 2 } E l s I 2 }- I 2 ) 

P - P„ - Pe 



(14) 
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Page ^herewe have assumed that the amplitude error is independent of the amplitude and we 
have used 



(15) 



The second expression is discussed below. We refer to E* as the error amplitude for the 
klh virtual user. The residual interference after MUD IC is 



(16) 



=—[(K- 1)0®*, + KE] ] 2 • t+ 1 p | 2 ] 

2./Vj 

^fc* + #fc-2-[l+lpl 2 ] 

Z/v j 

where all data channels have amplitude A The error amplitude for the control channels is 
denoted E c and the error amplitude for the data channels is denoted E* All data channel 
amplitudes are determined by scaling the corresponding control channel amplitudes by 
1/p c . Hence E d =EJ$ c . 

Similarly we can show that 



(17) 



so that the matched-filter interference is 



, 2 



= A_ [ (K - l)aA 2 + KftA 1 ]• 2 • [l+ 1 p | 2 ] 
2/V j 

s ^|« + p\ 2 k-2.[l + |p| 2 ] 



(18) 



Finally, the MUD efficiency is 



ft = ] _ _JnM!L = ] — — — 



MF 



(19) 
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4. Conventional Channel Estimation 

The conventional channel amplitude estimate is given by 

1 M 

where 

m' 

lu^-^t^mV^m-m'} (21) 

In the above b,M represent the known pilot bits. (The Ah virtual user is implicitly a control 
channel.) The number M represents the number of pilot bits used to derive the channel 
amplitude estimates. The channel amplitude estimate can be rewritten 

L K, L 



(22) 



L *v L 



It is shown in the appendix that 

• »wl = *«•*»• "«* i 2 U (23) 



1 I'qVp' li q *kp 

Hence the variance of the estimate is 
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^ rMi^, i 2 }= i i 2 R + ££ £ K* ' 2 K +v < sE <« (24) 



The factor p £ simply reflects the fact that the off-diagonal elements are smaller than the 
diagonal elements due to partial correlations p*p between the antenna elements. In the 
Appendix it is also shown that 



(25) 



Now combining Equations (24) and (25) gives for the variance of the channel amplitude 
estimate 



N, pi MN l k= i p= i 



1 ' 



N, L ' MN,tl ' 

where we have used A lp z = A?/L The first term represents the variance due to a user's 
own multipath interference. This term is small compared to the variance arising from the 
total multiple-access interference. For simplicity we incorporate part of this term into the 
second term and drop the remainder. The final term represents thermal noise and other- 
cell interference. For now we assume that thermal noise in small. The interference arising 
from other cells is assumed to be proportional to the same-cell interference, with a 
constant of proportionality f = 0.35. With these assumptions we have 
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(27) 



Notice that the magnitude of the error E, is approximately the same for all users. Also, the 
Ah users is implicitly a control channel, and hence /V, = PG = 256. If the K v virtual users 
are all at the highest spreading factor, then in terms of the K = KJ2 physical users we 
have 



E f 2 =(!+/) 



M PG 



[KfcA 2 + KaA 2 ] 



(28) 



where Ec is the magnitude of the channel amplitude error for a control channel, & is the 
relative control channel amplitude, A is the amplitude for the data channels, and where a 
is the activity factor for the data channels. Since the channel amplitudes for the data 
channels are determined by scaling the amplitude of the corresponding control channel it 
is evident that Ed = V/Jc Hence, 



= (! + /) 



KL 



M PG 



1 + - 



a 



(29) 



Given the parameters 



/ 

K 
L 

M 
PG 

a 



= 0.35 
= 128 
= 4 
= 18 
= 256 
= 0.4 
= 0.7333 



we get 




= Lo.35) (128)(4) 

f (1 8X256) L (0.7333) 2 J 



(30) 



= 0.51 



The number of pilot bits, M, is taken to be 18, which represents 6 bits per slot, the 
amplitudes averaged over 3 slots. The corresponding MUD efficiency is 

/^-(^Mo-sO 2 ^ (3D 
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5. Improved Channel Amplitude Estimates 

One method for significantly improving the channel amplitude estimates is to perform a 
second estimate directly on the data channels after the initial data channel demodulation. 
Performance is improved for two reasons. First, the entire slot can be used for integration. 
Hence we have M = 3(10) = 30 bits. Secondly, the error is not scaled by 1/ft since the 
estimate is performed directly on the data channel. For this method we have 



(1 + /) 



M ■ PG 



4 +a35, S 1<0 - 7333),+0 ' 401 (32) 



= 0.29 

and the corresponding MUD efficiency is 



^ D =l-(^-j=l-(°- 29 ) 2 = 0 - 92 (33) 

Sliqhtly better performance can be achieved by using both data and control channels 
This method can be performed either on the daughter card or on the modem card since it 
is a single user method. The assumption is that the matched-filter BER is sufficiently 
good. 

6. Multiuser Channel Amplitude Estimation 

Given the conventional channel estimates and the detected user bits it is possible to 
subtract the MAI which corrupts channel estimation. This method of channel estimation is 
referred to as multiuser channel estimation, as opposed to the conventional single-user 
estimation techniques. A simple multiuser channel estimation technique is presented 
below without analysis. Performance should be determined via simulation. 

From Equation (22) the conventional estimate is 

5„=IXv + {34) 

kp 

A multiuser estimate is obtained by subtracting the known interference among the channel 
estimates 
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(35) 



kp*lq 

where the (hopefully) improved multiuser channel estimate is denoted 5„ . The first term 
above is the actual channel amplitude. The second term is the residual interference and 
the last term represents thermal noise and other-cell interference, which is amplified by 
the multiuser interference subtraction. The extent of the amplification needs to be 
determined. 

7. Effect of Uncancelled Multipath Interference 

It is expected that a typical RAKE receiver will be capable of tracking up to approximately 
16 multipath components. Since the computational complexity of symbol-rate MUD is 
quadratic in the number of multipass L it is unlikely that MUD implementations will be 
able to cancel all multipath interference. The effect of uncancelled multipath is assessed 
below. 

Suppose that the RAKE receiver processes L' multipath components, but that the MUD 
implementation cancels interference for L < L' components. From Equation (13) we have 



* 9=1 p=l 

and the variance is then 
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^1 p=1 *' V / 9 =l p=L+! 

2N f ^ L+ , p= , ?=L + ! p=L + l 

2/V, 9=t+ j p= i ^'V/ 9=t4 , p=L+l 

2JV, q=ip=Ul ,=L + lp=l 9 =i + lp=L + l J 



A* s IX /IX 

p=L+l / p=) 



(37) 



Note that p x . k is the ratio of the uncancelled to cancelled interference for the tth users. 
Similarly, we have 

Now, neglecting the second order terms p x ,,p x , k and averaging over the users p x = E{p Xt ,} 
we arrive at 
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= 2 ' t + 1 p lj KA 2 jaE] + e; ) + 2p, (aA) + /t f 2 )} 



2N, 



= 2 t 1+lp|2 W^a + #)E* + 2J3,(« + tf)A 2 } 
2N, 

= 2 t >+ lP |2 W(tt + #){E>2M 2 } 
2N, 



A 2 +2P X A 2 ' 1 + 20, 



hi 



+ 2/3, 



(39) 



Note that ft is the ratio of the uncancelled to cancelled interference. 

In order to assess typical value for ft multipath models [1][2][3] were used to generate 
random profiles. The models are based on data collected in four areas (A, B, C, and D) in 
the San Francisco-Oakland bay area. Table 1 below summarizes the key results. The 
table shows the ft versus the number of multipath components L 





: L = 8 ■ 


'A/ " — 

■ L = 6 £ 


i l=4 ■■: 


L = 3 


1 = 2 


L = 1 • 


Area A 


0.0019 


0.0064 


0.0481 


0.0961 


0.2376 


0.5819 


Area B 


0.0012 


0.0086 


0.0404 


0.1115 


0.1416 


0.5749 


Area C 


0.0004 


0.0054 


0.0291 


0.0948 


0.1649 


0.6603 


Area D 


0.0039 


0.0128 


0.0430 


0.0629 


0.1435 


0.4890 



Suppose ft = 0.05 and (E</4) 2 = 0.51 2 = 0.260. Without taking uncancelled multipath into 
account we found Pmud = 0.74. Taking uncancelled multipath into account we find 



= 1- 



1 



1 + 2)3, 
1 



1 + 2(0.05) 
= 0.67 



[(0.51) 2 +2(0.05)] 



(40) 



where a worst-case ft = 0.05 is used. 
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8. Improved MUD Efficiency Due to Dropping Small Amplitudes 



If small amplitude multipath components are not included in the cancellation the MUD 
efficiency is reduced slightly due to the additional uncancelled multipath interference, but 
it is also increased because of the absence error resulting from the inclusion. of these 
small noisy estimates. The net effect is a substantial increase in the MUD efficiency. From 
Equation (30) we have 



+/) 



K 



= (1+0.35) 



M PG 
(128) 



1 + 



(18)(256) 



1 + - 



0.40 



(0.7333) 2 



(41) 



= 0.065 



where E d ? is the error due to a single multipath (i.e. L = 1). From Equation (37) it is 
evident that if a particular multipath amplitude satisfies A kp 2 < E d1 z then it is advantageous 
not to incorporate this amplitude into the cancellation since the error is greater than the 
amplitude. Table 2 shows the mean number of paths E{L} which satisfy A k p > E d , 2 and the 
ratio p\ of the uncancelled to cancelled interference if only these mulitpaths are cancelled. 
The MUD efficiency is then calculated using 



1 



1 + 2)8, 



E{L) 



+ 2p\ 



(42) 



Ta ble 2. Improved MUD efficiency (Pmud) due to dropping small amplitud es. 





■ E{L} : 




■ 'Jr •>. Pmuo . 


Area A 


2.0300 


0.0714 


0.7638 


Area B 


2.4660 


0.0691 


0.7482 


Area C 


2.2970 


0.0680 


0.7564 


Area D 


2.0690 


0.0625 


0.7748 


Mean 


2.2155 


0.0678 


0.7608 



9. Conclusions 

This report represents a first-look at channel estimation and the effect of errors on the 
MUD efficiency. Only the case where all users are at the highest spreading factor has 
been examined. The initial results indicate that if the conventional channel estimates are 
used the MUD efficiency drops to 74% due to estimation errors. If the effect of 
uncancelled multipath interference is also considered the MUD efficiency drops down to 
67%. If small amplitude multipath components are not included in the cancellation the 
MUD efficiency is reduced slightly due to the additional uncancelled multipath 
interference, but it is also increased because of the absence error resulting from the 
inclusion of these small noisy estimates. The net effect is a substantial increase in the 
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MUD efficiency, which is increased to 76%. The actual MUD efficiency will, of course, be 
less due to other factors which degrade efficiency. If an improved single-user channel 
estimation is used the MUD efficiency can be increased to 92%. This improved method 
requires knowledge of the pre-MRC matched-filter outputs. It is perhaps possible to 
further increase the MUD efficiency by employing multiuser channel estimation. These 
techniques also require knowledge of the pre-MRC matched-filter outputs. The above 
referenced MUD efficiency numbers are based on 128 users processed by the 
basestation. If fewer users are allowed access to the system in order to increase range 
the MUD efficiency is unchanged shine the total interference and noise remains 
unchanged. 
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Appendix A 

In order to estimate the variance of the channel amplitude estimate we need the second 
order statistics 




C; Vp .[m']} E{l lk [my /,,.[m']} 




(A1) 



where we have used 




which is derived in Appendix B assuming random codes. In order to evaluate £{/*[m']} 
we consider two cases: 1) k = / , and 2) k * I . For k = / we have 
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/W lm=l «=) J 



= 5 m . 0 + d-5 mT) )-i- 
m( \ M ) M 



whereas for k * i we have 



["»' l}= 77T 4]L £ M •*,[«]•*»[«-»']■*»[»-»'] 



J_ 



(A3) 



| MM 
m=\ n=\ 



Hence, combining Equations (A3) and (A4) we have 

4',W=V^-^)+^ (A5) 
Equation (A1) then becomes 

Now specializing Equation (A6) to the case where k = I 

The above expression is further, simplified if we assume that users are approximately 
synchronous so that /V /A , p /0/ ~ N,, which gives 

e\H w rl, s ± (A8) 



16 
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Similarly, specializing Equation (A6) to the case where k * I 



E \ H \A = _L (A9) 

» '*> 1 *M MNj 



Appendix B 

In Appendix A we used the approximation 

under the restriction that lq*kp. We show here that this expression is exactly true for 
chip-synchronous users, and that the approximation is reasonably valid for chip- 
asynchronous users, particularly when differences in delay lag are greater than about 2 
chips. The analysis is based on random user codes. 

The user correlations can be explicitly related to the code correlations as follows 
C lkqp [m] = -J- XX *K'" - j)N c +mT + r lq -x kp \ c\\i\- c k [j] 

(B2) 

c„ [t ] = tJj-XS sw - j)n c +t]c;[i] • c k u] 

Consider two cases: 1 ) / * k , and 2) / = k . 
Case 1 

When l*k the second-order statistics become 

4/V / /V / . y,^- 

-■mlf**^"*-^' «B3, 

g ij [r] = gl(i"j)N c +r] 
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where we have used the assumption of random user codes, independent among the 
users. Note also that the summation over i is over the range where d[i] is non-zero, and 
similarly the summation over j is over the range where c k [j] is non-zero. 

Case 2 

Now consider case 2 where / = k 



5,, 



47V ,/V,. t j V y 



When l±V we have 



4/V, ijrr 

=^r|l«vM •«r,[T , ]-4;n-fi[n-c,[y]-c;[/]} 

+ « f/ [T'i- £fe[ii-c ( [i']-c,u-]- c;t/]}| 

= ^{l^Wg iT [r-].25,.25 £ . 

+ I«W«, 7 [TT2£(: ( [r]c;[/]} 

»=>.'">' J 

= ^r|Z 6« W- - 25, • 25,. - £j,[t]- g r/ [T'] • 25, • 25,. 

I^.iTi^^T'j-aft/in c;t/]} 



+ 



(B4) 



= 8 n .glt]g[T'] 

whereas when 1 = 1' we have 

Efc„ [T] • C,;.[T']}= -L X *, M- « r/I*'] ■ fifcW • c,[f] • c, [y] • c;.[/]} 

4 "i «>"/ 



(B6a) 



(B6b) 
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= ^{l* 9 [T]* if [T']-2-2-X*[T]«[T , ]-2-2 

4/V, [ a 

+ I^[T]«,./[T']-2.25 r/ } 

= %|lg, j mg lJ [T , ]-W / g[T]«[T > ]+yV | ^[T]g[T']l 
= 5 /r g[T]«[T , ] + ^|£g,[T]g,[T']-/V < g[T]g[T']J 



Hence combining Equations (B5) and (B6c) we have 



£{c„[t] C;,.[t']}= 8 n .g[T] g[T']+ 8 > r d * 



and combining cases for / * jfc and / = * we have 



£{c,jT]C; i .[T , ]}=5 rt -5 ( ,.g[T]g[r'] 



rl, ij ,y 



(B6c) 



(B7) 



(B8) 



= g. ■ g rt .g[T]- g[T']- gft • ^ ^ g[T]- riT'l + j^S g,[f ] 

The above expression can be used to determine the second-order statistics for the 
general case of symbol-asynchronous and chip-asynchronous users with arbitrary 
spreading factors. In what follows we will be interested in approximating the above 
expression so as to get simple but meaningful results. In order to simplify the expressions 
we consider users all at the highest spreading factor, and we assume that certain small 
values are zero. 

To assess the accuracy of channel estimation we need to determine the second order 
statistics 



E{c ikqp [rn\ • C; w [m']}= E{c ik [r lk Jm]) ■ C,V[T,, V/ >')]} 



(B9) 
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with lq*kp. The function gfMgM in Equation (B8) above is small unless both x and x' are 
close to zero, and for the chip-asynchronous case function is exactly zero since unless 
both x and x' are equal to zero. Since for lq * kp the probability that xncqptmj is close to 
zero is small a good approximation is to assume that these functions are zero. The third 
term can be written 



£{C ( jT].C;.[Tl}=^|i-Xg,[T]^ y [T']j 



(B10) 



The double summation in the brackets 



(B11) 



is plotted in Figure B1 for Ni = N k = 256 versus x - x' for (x + x' )/2 = 0. 
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Figure B1 . Plot of SJt.t'] for N,= N k = 256 
versus x - x* for (x + x' )/2 = 0. 

The sharp localization around x - x 1 = 0 is valid for all values of (x + x' )/2, except that for 
(x + x 1 )/2 large peak value drops off due to the partial overlap of the codes. Hence for 
delay lag differences x - x' greater than about 2 chips a good approximation is 



(B12) 



This approximation then gives 



E{c fk [t] C[r']}= • 5,. • S„[t,t] 



(B13) 
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which implies 



Ete^W-C'nwW}^'*, A,- ■8 mm - S ll [r,x) (B14) 
provided the delay spread is less than a symbol period. Now it can be shown that 

SA^ P lrnl,r lk Jm']] = ^-^gl[T lk Jm'}} 

' * (B15) 

where N lkqp [m'] is the overlap between the user codes. Our final result is then 

E\p ik Jm) ■ C;,,,[ml}z ±- ■ 8, ■ 8 U . ■ 8„. ■ 8 p , • 5 W • (B1 6) 
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To: Wireless Communications Group 



From: J. H. Oates 

Subject: MUD interface to modem Date: January 3, 2001 



1. Multi-User Signal Model 

The Rake receiver operation described in the next section is based a signal model. The 
MUD algorithm and implementation are based on the same model. This model is 
described below. 

Figure 1 shows how the uplink complex spreading for the Dedicated Physical Data 
CHannels (DPDCHs) and the Dedicated Physical Control CHannel (DPCCH). There can 
be from 1 to 6 DPDCHs, denoted DPDCH k , for k from 1 to 6. If there is more than one 
DPDCH, then the spreading factor for all DPDCHs must be equal to 4. For a single 
DPDCH (DPDCHi) the spreading factor can vary from 4 to 256. The data bits for channel 
DPDCHi are spread by channelization code c d ,i = C C h,sF,sF/4, where SF is the DPDCH 
spreading factor. These channelization codes are referred to as Orthogonal Variable 
Spreading Factor (OVSF) codes. They are equivalent to Hadamard codes, except for their 
ordering. When there are multiple DPDCHs then dedicated channels DPDCH k , for /cfrom 
1 to 6 are spread by channelization codes c d ,k = C C h,4 t n, where the relationship between n 
and k is represented in Table 1 . 



Table 1. Relationship between n and k. 



n 


k .. . . 


1 


1,2 


3 


3,4 


2 


5,6 



The data bits for the DPCCH are spread by code c c = C C h t 2S6.o- The spreading factor for the 
DPCCH is always equal to 256. The multipliers f} c and (5 d are constants used to select the 
relative amplitudes of the control and data channels. At least one of these constants must 
be equal to 1 for any given symbol period m. 
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3.84 Mcps 



Q <x> 



DPDCH, ><X) <X)— »• 

DPDCH 3 ^X) KX)— ^ 

DPDCH 5 ^X) <X>— * 

DPDCHj ><X) »<X)— ► 

DPDCH 4 ><X) "KX)— *• 

DPDCH 6 KX) <X)~ * 

DPCCH »{x) <X)— ► 

Figure 1 . Uplink complex spreading of DPDCHs and DPCCH 

The uplink spreading for any one of the seven Dedicated CHannels (DCHs) above can be 
represented as shown in Figure 2. 




Km] 



fSF 



c[n] 



'sh 



I I 3.84 Mcps 

► <x) — <X> 



Figure 2. A second representation of the uplink spreading 
for any one of the seven Dedicated CHannels (DCHs). 



where the code c[n] is given by 

C 



d«] = 



a^oM'/M"!. DPCCH 

M ■$*[»]. DPDCH, 

C rt .a6.«W'A["]. DPDCH 2 

CabwbW-^W, DPDCH 3 



(1) 



'c/i.256.192 



W-iS^W. DPDCH 4 
CAawaW-S^W, DPDCH 5 
[n]jSJn], DPDCH t 



'di. 256, 128 
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and 



P 



DPCCH 
DPDCH, 



(2) 



For a DCH with a spreading factor less than 256 there are J = 256/SF data bits 
transmitted during a single 256-chip symbol period (i.e. 1/15 ms). From a signal model 
perspective, the J data bits transmitted per symbol period can be viewed as arising from J 
virtual users, each transmitting a single bit per symbol period. The idea is illustrated in 
Figure 3. 



b 0 [m)=b[0 + mJ) » |256 



b^[m] = b[\ + mJ] > |256 



c 0 ln) P 



bj_,[m\ = b[J -\ + mJ] 




Figure 3. Transforming a single user with bit rate J bits per symbol period 
into J virtual users, each with bit rate 1 bit per symbol period. 

The codes for these virtual users are formed by extracting SF elements at a time out of 
the DCH code sequence to form J new codes. Each of the J codes is of length 256 chips, 
but with only SF non-zero chips. That is, 



r i l c[n ^ j 



■SF<n<(j + \)SF 
otherwise 



(3) 



This code-partitioning concept is illustrated in Figure 4 for the case SF = 64 so that J = 
256/SF = 4 codes are derived from the one DCH code. 
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I 



F/grure 4. Code partitioning concept illustrated for the case SF = 64, 
whereby J = 256/SF = 4 codes are derived from a single DCH code. 

The control channel can also be viewed as a virtual user. Hence, for a given physical user 
with spreading factor SF there are 1 + 25&NJSF virtual users, where No is the number of 
DPDCHs. (Recall that for N D > 1 , SF= 4.) 

It turns out to be convenient to use a double indexing scheme to i dentify virtual users. Let 
paired indices kj represent the /th virtual user associated with the Mh dedicated channel. 
Index j varies from 0 <= j < J k = 256/SF kj where SF k is the spreading factor for the fcth 
dedicated channel. For the remainder of this section the spreading factors SF k are 
assumed to be constant. In section 3 the equations are reformulated to allow for symbol- 
by-symbol changes in the spreading factor. 

The transmitted signal for virtual user kj can be written 

^W=AXn > U-mrife,.[m] (4) 

m 

where t is the integer time sample index, T - NN C is the data bit duration, N = 256 is the 
short-code length, N c is the number of samples per chip, b kj [m] are the data bits, and 
where v kj [t] is the transmit signature waveform for virtual user kj. This waveform is 
generated by passing the spread code sequence c kj [n] through a root-raised-cosine pulse- 
shaping filter h[t] 

s kj [t) = Zh[t-pN c ]c kj [p] (5) 

Note that p k = p c if the kjih virtual user corresponds to a control channel. Otherwise p k = 

A/. 

The total number of virtual users is denoted 
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where K D is the total number of dedicated channels. The baseband received signal after 
root- raised-cosine matched-filtering can be written 

= S'fey' -«n*,[«l + nM (7) 

where \n[\] \s receiver noise with a raised-cosine power spectral density, and where s kj [t] 

is the channel-corrupted signature waveform for virtual user kj. For L multipath 
components the channel-corrupted signature waveform for virtual user kj is modeled as 

L 

where are the complex multipath amplitudes. The amplitude ratios p k are incorporated 
into the amplitudes a kp . Notice that if k and / are two dedicated channels corresponding to 
the same physical user then, aside from scaling the by fi k and p h a kp and a /p , are equal. 
This is due to the fact that the signal waveforms of all virtual users corresponding to the 
same physical user pass through the same channel. The waveform s kj [t] is referred to as 
the signature waveform for the kjth virtual user. This waveform is generated by passing 
the spread code sequence c kj [n] through a raised cosine pulse-shaping filter g[t] 

s kj [t]^g[t-pN c ]c kj [p] (9) 
Note that for spreading factors less than 256 some of the chips c kJ [p] are zero. 
2. Rake Receiver Operation 

This section describes the operation of a typical Rake receiver. Figure 1 shows a 
representation of the received antenna data that is delivered to the Rake receivers of all 
users. 
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One symbol (i.e. 256 chips) 

Start of , 1 , 

buffer 



Userl 



< TV 



'c, 



Start of frame for user 



I l 



i i i i 



n 



i i 



Userk 



Start of frame for user k 



Start of MUD 
processing frame 



Figure 5. Received antenna data delivered to the Rake receivers of all users. 

The figure shows the received signals corresponding to users / and k. These signals are 
combined in free space so that the receivers gets one composite signal, which we denote 
ift]. The buffer length is assumed to be an integral number of frames in length so that 
delay lag values T /q are approximately constant with each new filling of the buffer. For 
each finger of each user there is a delay lag value T /Q indicating the start of frame for the 
qth multipath of the /th user. Lag values T /Q are assumed to be constant over a frame, but 
are allowed to change from frame to frame in response to the delay locked loop operation 
and in response to new searcher-receiver sweeps where new delay lags are found. The 
lower case values x !q = T iq mod 256N C denote the symbol-period offset relative to the start 
of an internal symbol period reference clock. Notice that the user spreading factors 
change on user frame boundaries. Since users are asynchronous it is impossible to have 
a MUD processing frame that corresponds to all user frame boundaries. Hence the MUD 
processing frame is matched as close as possible to the user frame boundaries, but does 
not necessarily correspond precisely to any user's frame boundary. Consequently there 
will be spreading factor changes that occur during a MUD processing frame. Handling 
these mid-frame changes is the subject of section 3 below. 

The received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. Since the spreading factor for the 
DPDCHs is not known, the Rake receiver performs an initial 4-chip despreading over all 
DPDCHs. The Fast Hadamard Transformation (FHT) can be used here to reduce the 
number of operations. The detection statistics for the multiple fingers and multiple 
antennas are maximal-ratio combined. Since the DPCCH is always spread with a 
spreading factor of 256 the DPCCH can be entirely despread during each symbol period. 
TFCI bits are extracted each slot from the DPCCH. After an entire frame is processed the 
TFCI is decoded and the spreading factor for that frame is determined. After spreading 
factor determination the final DPDCH despreading is performed. The resulting detection 
statistics are denoted here as y*/A7?J, the matched-filter output for the kfih virtual user for 
the /77th symbol period. Since there are K v codes, there are K v such detection statistics, 
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which are collected into a column vector y[m] for the mth symbol period. The matched- 
filter output yn[m], for the Mh virtual user can be written 



where a lq is the estimate of a lq , f lq is the estimate of x lq , and N f is the (non-zero) length 

of codes Cn[n] (i.e., the spreading factor for the Ah dedicated channel). The intermediate 
result ynjm] represents the despread signal at the gth lag, and is here referred to the pre- 
MRC matched-filter output. When multiple antennas are employed, r[t], ynjm] and a lq are 

column vectors with one complex element per antenna. 

The matched-filter detector estimates the transmitted data bits as b (i [m] = sign{y u [m]} . 
Multiuser detection is considered in the next section. 

3. Multiuser Detection Equations and Asynchronous Processing 

As shown in Figure 5 a MUD processing interval must necessarily by asynchronous with 
most user's frame boundaries since the users are asynchronous. Because of this 
spreading factors will change during a MUD processing frame. When the spreading factor 
changes during the processing frame the MUD equations are modified. These 
modifications are considered in this section. 

The modem delivers matched-filter data to the MUD function on a frame-by-frame basis. 
Let N P [r] represent the number of physical users accessing the system during frame r. For 
each frame the following data is received for physical users p = 1 to N P [r] and each 
dedicated channel / 

• Number of DPDCHs, N DtP 

• Spreading factor, SF, 

• Amplitude ratios p d and f} c 

• Slot format 

• Channel amplitude estimates a iq 

• Channel lag estimates Ti q 

• Matched-filter outputs f fi [m] for all DCHs 

• Code numbers 

• Gap information for compressed mode 

Matched-filter outputs f fi [m] correspond to the matched-filter outputs y r ,[m]. If the Ah 
dedicated channel is a DPCCH then matched-filter outputs are only received for the TPC, 
TFCI and FBI bits. The f fi [m] values are mapped to the yrfm] values as described below. 
The mapping accounts for the frame offsets between the various users. The amount of 
matched-filter data received per physical user depends on the DPDCH spreading factor. 




(10) 



For each dedicated channel a symbol offset m f is determined according to 
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m "|i| T + 



iv (256/V c ) 



(11) 



where div denotes integer division (i.e. with truncation). The symbol offset represents the 
fact that the users and hence the frame data are asynchronous. The y-data used for 
interference cancellation is derived from the frame data using 



y,j[m] = fnlm-m,] 



(12) 



Figure 6 shows an example mapping of user data frames to MUD processing frames. To 
illustrate concepts the frames are each 16 symbol periods long rather than the actual 150 
symbols for WCDMA. The height of the blocks represents the number of virtual users per 
physical user. For physical users 1 and 4 the spreading factor changes in going from data 
frame 1 to data frame 2. As shown in the figure this results in spreading factor changes 
within the MUD processing frame. The MUD function is designed to Calculate the C- 
matrix once per frame. Hence mid-frame changes to user spreading factors pose a 
problem which requires special treatment. It turns out, and will be shown below, that mid- 
frame changes to the spreading factor can be accommodated by performing modified 
calculations based on the minimum spreading factor over the MUD processing frame. 



Physical User 1 
Physical User 2 | 



User data frames 
Frame 1 Frame 2 



Will I ■■ 



■OHM 



MUD processing frames 
Frame 1 Frame 2 



Physical User 3 
Physical User 4 




Figure 6. Mapping of user data frames to MUD processing frames. 



First we develop the MUD matrix signal model which allows user spreading factors to 
change on a symbol-by-symbol basis. We then show how we can perform the processing 
based on the minimum user spreading factors over the MUD processing frame. 

Let us reformulate the signal model presented in section 1 so as to allow' spreading 
factors to change every symbol period. For every DCH /c, there are J k [m] virtual users, 
where index m is the symbol period index. The number of DCHs J k [m] \s 

(13) 

SF k [m] 
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where SF k [m] is the spreading factor for the /cth dedicated channel during the mth symbol 
period. The signature waveform for the yth virtual user of J k [m] total belonging to the /cth 
DCH over the mth symbol period can be written 

s kj Jt]^8lt-pN c ]c kj Jp] (14) 

where the codes and hence the signature waveforms now include the symbol-period index 
m to account for symbol-by-symbol spreading factor changes. The channel-corrupted 
signature waveform is then 



s 



M = IV^P-S1 (15) 



kj,m 

and thus the received signal corresponding to K D dedicated channels is 

k=\ m y=0 



M = iLl'lKj' - "TV> v [m] + m (16) 



The MUD matrix signal model proceeds from substituting the received signal r[t] from 
Equation (16) into Equation (10) for the matched-filter outputs 

„ k=\ j=o q=\ . z/Vj L/nj r j 



K 4 A In]-! 
n k=l j=0 

(17) 



9=1 p=l 



where rj//mj is the match-filtered receiver noise and N f [m] = SF,[m]. The terms for m'<>0 
result from asynchronous users. 

The delay lags i /q for a given DCH / will under most circumstances be grouped within a 
range of from 4 to 8 ^is. Under extreme conditions the delay spread will be as high as 20 
lis. In any event, let i, represent the mean delay lag t, q over index q. According to 
Equation (10) above, the matched-filter detection statistic y fi [0] is the result found by 
correlating the received signal starting roughly at delay lag x,, where x, is approximately in 
the range 0 to 256A/ C . If t/ moves significantly outside this range an adjustment in the 
symbol period alignment will need to be made to restore X/ back to within the desired 
range. More will be said about this below. Along the same lines, the detection statistic 

9 
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yjm] is the result found by correlating the received signal starting roughly at delay lag t, + 
mT. 

For efficient MUD processing it is important for the C-matrices to be constant over a 10 
ms MUD processing frame. We now describe a method which operates on constant C- 
matrices. Handling changes to user spreading factors is relegated to the IC portion of the 
MUD processing. Let us define 

J t = max J k [m] ( 18 ) 

where the maximization is over symbol periods m that contribute to the current MUD 
processing frame. This includes not only symbol periods that fall within the MUD 
processing frame, but in addition a few symbol periods on either side due to 
asynchronous users. Note that the minimum spreading factor for the Mh DCH is SF k - 
256/J k . Now define the DCH contraction factor for the mth symbol period as 



CJm] = 



- J * (19) 



•Mm] 



The DCH codes for a given symbol period can be expressed as a sum of the DCH codes 
corresponding to the minimum spreading factor. For the Wh DCH there are at most J* 
virtual users corresponding to the minimum spreading factor. Let the codes for these 
users be denoted c kj [r], 0 <= j < J*. The codes for the mth symbol period, where there 
might be fewer virtual users, are denoted c kj ,M 0<=j < JJmJ, where 

U+l>C,|mH 

<Wr)= Ic,.[r] (20) 

j=;C,Im) 

With this result we are now able to represent the MUD signal model in terms of the C- 
matrix and R-matrix elements based on the codes corresponding to the minimum DCH 
spreading factors. The C-matrix in Equation () above becomes 

m 2Nf[m]r s 

(i+i)C,|mH U+0Ql"H 

= — 1 YY$[(r - s)N c + (m - n)T + f lq - r kp ] ^[r] • £ c */ [s] 

2N f [m] , : .W-C,[m] >=> C t M 

N,[m] r =ic,\m] r=jc t \n] 2N ; r , 

^V/l^J i'=iC,[m) r=jC t [n) 



c % [«-«]4^ ?l(r " sR+(ffl "" )r+f, '" J, ' lc;ilr! ' c, ' [l1 
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where N, = min N{m]= SF,. Similarly, the R-matrix becomes 

q=l p=l 



N (i+OQImH (i+iK»(«H t t f i 

= T^h I I IlRefe^C,,-.[m-,]} 

/V/L^J r=iC,\m) j'=jC t [n) ?=1 p=l 
N (i+\)C,[m}~\ (i+l)C t [nM 

= T^"T I I *vl— "1 (22> 

/V,|mJ /V/Clm] 



9 =1 /?=! 

so that the matched-filter outputs become 



n *=l j=0 (23) 
Ad AI«H f N V+iyCAmY-i (j+l)Al»H 1 

=x£z U71 i I r 1(V [«-»]kw+ii.[«] 

» Si >=o l^/L m J /w-Qim] j--K,W J 



j=0 i 

This last equation can be written 



„ h y=o |W/D«] wc,w ;=K*M J 



(i+\)C,[m) 



y«M s XX £ I r w [ifi-n] 

n k=\ j=0 [ j'=jC t [n) 



b kj [n\ 



K n J t [n)-\ \(j+\>C k [n)-\ 

= IZI S r, ( ,,[m-n]b,.[n] 

n A=1 ;=0 [ r=jC k [n) 



n fc=1 /=0 



(24) 



where we have defined brfrtf = Wfor jC*W <=f < (j + 1JCM Equation (24) is based 
entirely in terms of matrix elements corresponding to the minimum spreading factor for the 
MUD processing frame. 
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To: Wireless Communications Group 
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Subject: WCDMA Downlink MUD Date: February 23, 2001 



1. Introduction 

Multiuser Detection (MUD) is most often thought of as a technique to improve either 
capacity or coverage for the uplink. A few reasons why MUD is uplink-focussed are 
. Downlink MUD must be performed in the handsets, which are limited in process.ng 
power 

• Each handset is interested in only one signal 

• In the downlink users are separated by orthogonal codes 

However there is typically a greater demand for capacity in the downlink. If MUD is only 
appTed in the up.ink the imbalance is even greater. While in the downhn ; users are 
separated by orthogonal codes, because of multipath there ,s still s.gmf.cant intra-ceK 
interfernece Equalization has been suggested as a means of restoring orthogonal ty 
however the computationally attractive linear equalization methods tend to amplify the 
othe-cell interference and noise. 

A downlink MUD method is described in the next section which has reduced I complexity 
The Fast Hadamard Transform (FHT) is used to reduce complxity. The FHT is used in 
both the forward (demodulation) and backward (regeneration) directions. 

2. The Method 

The method proceeds according to the following steps 

. Receive amplitude and delay information form the searcher receiver 

• Start with the largest multipath 

. Multiply the received signal by the conjugate of the scramblmg code (512 chips at a 

time) , j . . v 

. Perform the FHT on the result (for multirate users, this is done in stages) 

• Determine soft data estimates 

Page No. 105 



EV 093 931 868 US 

Page No. 132 

• Set user-of-interest data symbols to zero. 

• Do same for all multipaths 

• Proceed till end of slot 

• Estimate amplitudes and gain factors 

• Diversity combine results and make hard decisions 

• Use hard decisions, gain estimates and FHT to reconstruct chip sequence c[n] (with 
user of interest nulled) 

• Multiple c[n]by c sh [n]\o form d[n) (with user of interest nulled) 

. Use amplitude estimates, delay lag estimates (from searcher) and raised-cosine pulse 

to construct chip filter . 
. Pass d[n) (with user of interest nulled) through chip filter to reconstruct interference 

signal 

• Subtract interference signal from received signal 

• Demodulate with conventional RAKE receiver 

The WCDMA transmitted signal can be represented as 

s[t] = J,g[t-nN c ]d[n] 



d[n] = {X G A div N * 1 - c ^* [rt] } ' c > hln] 



0 

= c[n]c sh [n] 

K 

c[n] = Y,G k h [n div N k ]- c ch k [n) 
k=\ 

where g[t] is the raised-cosine pulse 1 , N c is the number of samples per chip, and d[n] is 
the composite chip sequence from all users. The received signal is then 

i 

rM = X<Vl f - T » ] 

o 

9=1 n 

The received signal advanced to the delay of interest is 



1 The chip-matched filter is artificially placed in the transmitter for simplicity of presentation 

Page No. 106 



EV 093 931 868 US 
Page No. 133 

rinN, + T„ ] = X a A nN < + r p ~ T » 1 



The received signal multiplied by the conjugate of the scrambling codes is 
r\nN c +'C p ]c, h [ri\ = Y j a^g[T p -'C q -mN c ]c[m + n)c sh [m + n] ■ c m [n] 



= 5,, c(n}+»v[n] 



0 



This result can now be demultiplexed using the 512 x 512 FHT. Since 512 = 2 , the FHT 
proceeds in 9 stages. After the first two stages the SF 4 symbols can be extracted. 
Similarly, after k stages the SF 2 k symbols can be extracted. The amplitudes a p can be 
determined from the embedded pilot symbols, or searcher-receiver estimates can be 
used. If embedded pilot symbols are used the measurements M pk of the pth multipath of 
the fcth user is in the form 

M pk =a„G k 0 

which includes the user gain factor. After measurements are taken for all multipaths and 
all users for a given slot, the multipath amplitudes and user gains can be separated by 
determining the dominant left and right singular vectors of the rank-1 matrix M pk (aside 
from an arbitrary scale factor which can be given to either amplitudes or the gains). One 
the approximate amplitudes 5 p are known the actual amplitudes a p are determined by 
inverting the diagonally dominant system of equations 



5=1 
L 



= 5>,A 0 

9=1 



g pq S «[*,-T,] 

The chip filter h[t]1or reconstructing the interference signal is 



3 
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H.A. Bootstrap Functional Design Specification 

30 

31 1 Purpose 

32 The purpose of this memo is to document parts of the discussion we have been 

33 having on how the TI 6414 DSP may connect to the raceway. 

34 
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" H.A. Bootstrap Functional Design Specification 

35 2 Glossary 

36 EMIF - A port on the DSP 6000 series peripheral bus which allows the 

37 connection of memory devices. 

38 SDRAM - In the context of this memo, means the main external memory of the 

39 Tl DSP - the one which contains the program and data. 

40 3 Overview 

41 

42 So far, a proposed architecture is that we use the second EMIF (External 

43 Memory Inter-Face) of the TI 64 14 DSP to connect to a dual ported RAM. 

44 Raceway transfers actually access the RAM, and then additional processing takes 

45 . place on the DSP to move the data to the correct place in SDRAM. In fact, if the 

46 dualport RAM is not large enough to buffer an entire Raceway transfer, then there 

47 will have to be a messaging protocol between the two endpoint DSPs wishing to 

48 exchange messages (because the message will have to be fragmented in order to 

49 not exceed the reserved buffer space). 

50 An additional restriction of this design is that as more Raceway endpoints are 

5 1 added, the size of the dualport RAM needs to be increased, or the maximum 

52 fragment size needs to shrink, such that the RAM is big enough to contain at least 

53 2*F*N*P buffers of size F, where F is the size of the fragment, N is the number of 

54 Raceway endpoints with which this DSP can exchange messages, P is the number 

55 of parallel transfers which can be active on any endpoint at a time, and the 

56 constant 2 represents double buffering so that one buffer can be transferred 

57 to/from the Raceway, while a second buffer can be transferred to the DSP. The 

58 constant becomes 4 if you want to be able to emulate a full duplex connection. 

59 With a 4 node system, this might be 4*8K*4*4 or 5 12K plus a little extra for 

60 bookkeeping information. This probably means the minimum size is 1M bytes for 

61 the dual port device. 

62 4 Problem Identification 

63 There are several characteristics of this architecture which could prove 

64 problematic: 

65 4.1 Requirement For A Fragment/Defragment Protocol 

66 Raceway transfers can currently be very long. This architecture would require 

67 a protocol for breaking transfers down into fragments. If the DSP is sourcing a 

68 transfer greater than the fragment size, then it has to either dedicate itself for the 

69 period of the transfer to programming the DMA engine, or it has to respond to 

70 interrupts as each fragment is transferred. In either case, there is a substantial 

71 performance impact above and beyond the normal performance hit due to 

72 memory bandwidth utilization. 
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73 if the DSP is on the receiving end of a Raceway transfer, a similar process has 

74 to take place, except that there must be an interrupt to get the attention of the DSP 

75 (polling would not be sufficient in such a case). 

76 Beyond the performance hit such a protocol would impose on the DSP, there is 

77 a major disadvantage in that only endpoints willing to implement this protocol can 

78 exchange data with the DSP. It is in effect, defining a defacto standard subset of 

79 Raceway. This is a major interoperability issue (you can no longer plug a board of 

80 DSPs into a fabric and have them work as a standard Raceway Adjunct 

81 Processor). 

82 4.2 Requirement For The DSP To Be Running Code 

83 If the DSP is involved in the Raceway transfers, then the DSP must already be 

84 running in order to perform Raceway transfers. This will require that all nodes on 

85 the Raceway be self booting. 

86 4.3 Lower Transfer Rates 

87 Raceway is less efficient with smaller transfer sizes. If the fragment size is kept 

88 small to minimize dual port ram requirements, then aggregate Raceway transfer 

89 rates will be lower because of less effiicient utilization of the fabric. 

90 4.4 It Is Different 

91 By changing the way Raceway works, we initiate a significant departure from 

92 the way all current Mercury systems work. While there are many other possible 

93 architectures which will perform well, it is inherently risky to change a 

94 fundemental model of how our multiprocessors communicate. 

95 5 An alternative Architecture 

96 It may be possible to implement a different architecture which addresses some 

97 of these shortcomings. 
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100 

101 5.1 Architecture Description 



102 The proposed architecture still has approximately the same hardware as the 

103 existing architecture. The changes are in the way that the Raceway transfers move 

104 between SDRAM and the Raceway. 

105 In the proposed architecture, the FPGA connects to both the buffering device 

106 (dual port RAM or FIFO) and the DSP. The connection to the buffering device 

107 (hereafter FIFO) is used to move Raceway data to/from the FIFO. 

108 The second connection is to the DSP Host Port. Dave currently believes this is 

109 a moderately high performance interconnect - on the order of 75 Mbytes per 

110 second. This interconnect could itself be used to move data to/from the DSP. The 

1 1 1 host port can access data in the DSP on-chip memory, as well as any of the 

112 peripheral devices, including the SDRAM. However, 75Mbytes per second is 

113 pretty slow compared to normal Raceway bandwidth, and we think we can do 

114 better. 

1 15 The 6414 contains a second EMIF which can be attached to the FIFO (this is 

1 16 similar to what the current architecture proposal intends). The difference in this 

117 proposed architecture is that rather than have the DSP program the DMA engine 
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118 to move data between the FIFO and the DSP/SDRAM, we propose that the FPGA 

119 can program the DMA engine directly via the Host Port. 

120 The Host Port is a peripheral like the EMIF and the Serial Ports. The difference 

121 is that the Host Port can master transfers into the DSP datapaths, i.e. it can read 

122 and write any location in the DSP. Because the Host Port can access the DMA 

123 Controller (we think), it can be used to initiate transfers via the DMA engine. 

124 The advantage of this architecture is that Raceway transfers can be initiated 

125 without the cooperation of the DSP. Thus, the DSP does not have to be self 

126 booting. Performance is increased in two ways: the DSP is free to continue to 

127 compute while Raceway transfers take place, and performance on the Racway is 

128 increased because there is no need to fragment messages. 

129 The internal datapaths of the DSP are flexible enough that we can control 

130 which devices have priority access to memory and datapath. Specifically, we can 

13 1 choose to give Raceway transfers priority over the CPU, or vice versa. 

132 5.2 Synchronization Issues 

133 There is an issue to be solved in how we match data rates between Raceway 

134 and the DSP. The EMIF looks to the DSP as if it were a memory, thus it is 

135 reasonable for the DSP to assume it can get at the data it needs at any time. 

136 However, if we indeed use a FIFO to buffer data, the implication is that there is a 

137 way to hold off the DSP when we are waiting for the Raceway to empty or fill our 

138 FIFO. A possibility is that the buffer device remains a dual port RAM rather than 

139 a FIFO, and the FPGA actually does a fragment/defragment into the RAM, and 

140 then programs the DMA engine to move that fragment into/out-of the DSP. This 

141 starts to look somewhat like the original architecture, except that because the 

142 FPGA performs the frag/defrag, the actual transfers over the Raceway can be 

143 arbitrarily sized (assuming we can throttle the Raceway). 

144 Synchronization remains one of the larger problems to be solved with this 

145 proposed architecture. 

146 5.3 Sample Transfers 

147 In order to illustrate how this architecture would work, two examples are 

148 given. The first example is when the Raceway attempts to read data out of the 

149 DSP memory. 

150 5.3.1 Raceway Reading DSP Memory 

151 In this example, we assume that another DSP is trying to read the SDRAM of 

152 the local DSP. 

153 i) The FPGA detects a Raceway packet arriving, and decodes that it is a read 

1 54 of address Ox 1 0000 (for instance). 

155 2) The FPGA writes over the Host Port Interface in order to program the 

156 DMA engine. It programs the DMA engine to transfer data starting at 

157 location 0x10000 (a location in the primary EMIF corresponding to a 

158 location in SDRAM) to a location in the secondary EMIF (the buffer 
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159 device/FIFO). As data arrives in the buffer device, the FPGA reads the 

160 data out of the buffer device, and moves it onto the Raceway. When the 

161 proper number of bytes have been moved, the DMA engine finishes the 

162 transfer, and the FPGA finishes moving data from the FIFO to the 

163 Raceway. 

164 5.3.2 Raceway Writing DSP Memory 

165 In this example, we assume that another DSP is trying to write to the SDRAM 

166 of the local DSP. 

167 1) The FPGA detects a Raceway packet arriving, and decodes that it is a 

1 68 write of location 0x20000 (for instance). 

169 2) The FPGA fills some amount of the buffer device with data from the 

1 70 Raceway, and then: 

171 3) The FPGA writes over the Host Port Interface in order to program the 

172 DMA engine. It programs the DMA engine to transfer data from the buffer 

173 device (secondary EMIF) and to write it to the primary EMIF at address 

174 0x20000. 

175 4) At the end of the transfer, we could either interrupt the DSP to signal that 

176 a Raceway packet has arrived, or we can use the standard Mercury method 

177 of polling a location in the SMB to see whether the transfer has completed 

178 yet. 

179 5.4 Additional Thoughts 

180 1) We need to verify that the Host Port Interface can program the DMA 

181 engine. The documentation on the 6201 clearly states that it can write to 

182 any location in internal memory, and to anywhere on the peripheral bus, 

183 however the DMA engine/controller is the datapath controller for all that, 

184 so it is always possible that there is a special case which does not allow 

1 85 writing of the DMA engine/controller registers from HPI. The chance of 

1 86 this being so is quite remote, but needs to be verified. 

1 87 2) We need to understand the transfer rates and latencies of the HPI. This 

1 88 architecture relies on fairly low latency access through the HPI, otherwise 

189 more buffering space would be required, and at some point bandwidth 

190 begins to be affected. 

191 3) We need to understand the limitations of Raceway with respect to 

192 throttling, etc. The best case would be that Raceway can provide data as 

193 fast as the EMIF can take it (so we wouldn't worry about having data 

194 ready when EMIF wanted it), and also for Raceway to be able to be 

195 throttled so that it can take the data at the rate the EMIF can provide it. 

196 The more the reality deviates from this best case scenerio, the more extra 

197 logic is required in the FPGA until at some point complexity may prevent 

198 the architecture from being viable. 
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! 99 4) what we currently know about the 64 1 4 is actually educated guesses 

200 based on documentation of earlier DSPs. We are making some 

201 assumptions about how Tl will have enhanced their chip. 

202 5) If/when TI ever puts a RapidIO interface on their DSPs, it will almost 

203 certainly look like a high speed HPI, i.e. it will sit on the peripheral bus, 

204 have a separate datapath channel, data coming in will simply flow to the 

205 correct addresses, and outgoing data transfers will happen by 

206 programming the DMA engine to send data to the RapidIO peripheral 

207 address. This proposed architecture looks almost exactly like that, and so 

208 probably will not require major changes to use a RapidIO enhanced DSP. 

209 6) There are probably more thoughts., but this is probably a good start. . . 

210 
211 
212 
213 
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6201 Design Options 
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Option 1 is the original proposal submitted at the DSP meeting Monday. Option 2 was created during the 
meeting. 

The main shortfall in Option 1 is the sharing of the EMIF bus between the 6201 and the Raceway 
DMA FPGA During DMA operations over the Raceway, the 6201 will not have access to the EMIF 
interface Any data or instruction fetches from SDRAM will stall. Given the relatively small size of the 
internal SRAM, this will impose a significant penalty to the operation of the 6201 . OpUon 1 also requires 
the FPGA to take over SDRAM refresh operation when it takes control of the EMIF bus. This passing back 
and forth of the refresh task will not be clean. 

Option 2 places a bi-directional transceiver between the 6201's EMIF bus and the Raceway 
SDRAM This allows the 6201 to process data and fetch instructions without any interruption from it's 
local SDRAM while the DMA FPGA is accessing the Raceway SDRAM. The HPI interface is used by the 
620 1 to program the DMA engine and by the DMA engine to indicate the DMA complete status to the 
FPGA. Option 2 also lends itself to a dual 6201 node per raceway interface. Decode logic, controlling 
access to the Raceway SDRAM can be designed in a number/combination of ways; 

Total access to both 6201 s 

Separate areas for each 6201 

Read but no write to the other 6201 's memory space 

A separate common area accessible to both for message passing 

The ability of one 6201 to go through the transceiver to the others local SDRAM (not recommended) 
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For a migration story to the 6414, Option 2 is a better sell, Option 3 shows the 6414 design, the transceiver is 
stripped off and the Raceway SDRAM is connected to the second EMIR The design will go to one DSP per 
raceway due to the increased in processing power of the 64 1 4. 
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1, Introduction 

Typical processing: 

Signal is sampled at N samples per chip. 
Despread by 

upsampling chipping sequence by interpolating and using the RRC chip pulse matched filter as an 
interpolation filter 

Multiplying digitized receive signal by upsampled and interpolated chip sequence 
Accumulate (integrate) results for an entire DPCCH symbol. 

Repeat at the early lead and late lag sample offset values to calculate delay locked loop variables 
Sweep the code correlator N*256 lags to determine code synchronization and channel response 

Spreading sequence is 256 chips long 

Typical filter is 12 chips long 

typical oversampling rate on the receiver is N=8 

Key calculations 

Interpolation of the spreading code - precomputed and stored 

Correlation process: N*256 CMAC 

Correlation repeated for N*256 + 2 (DLL) times 

Total CMACS: N*256 * (N * 256 + 2) = N A 2*65536 + 512 * N 

For N = 8, this results in: 4,198,400 CMAC 

1 CMAC = 4RMUL + 2RADD = 6ROP 

Results in 25,190,400 Real operations 

At 15000 Hz symbol rate, need: 378 GOP/s 



Page No. 121 



EV 093 931 868 US 

Page No. 148 

2. A New Design 

Use of FFT to perform efficient circular convolution of spreading code sequence 
Results in 

Short code synchronization ( chip sync only, not slot or frame ) 

DPCCH demodulation 

Early and late Delay Locked Loop variables 

Rough channel estimate values for an entire symbol worth of differential delay 
Polyphase signal processing 

Digitize the signal at an Nx oversample rate and filter with the RRC filter and split into N 
the lx rate. 

Compute the complex conjugate of the FT of the spreading code sequence at the chip ra 
precomputed and stored 

Computation: 

Filter data at Nx oversample rate and split into N streams at lx rate 



For each stream, 

Compute 256 point FFT 

Complex multiply FFT with stored FFT values of spreading code 
Inverse 256-point FFT 

Ops calculation: 

Input filter: could be done using FFT as well. 

but for time domain processing: 8*256 points, filter length 96 => 

96 RMUL per point, 95 RADD per point, ' 

Total of 19608 RMUL, 194,560 RADD per symbol = > 391,168 ROP per symbol 

I and Q streams, => 782336 ROP 



Stream processing ( 8 streams ) 

Radix 4 FFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
256 CMUL =1536 ROP 

Radix 4 IFFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
TOTAL per stream: 7 11 68 ROP 

Total stream calcs: 569,344 ROP 

Total ops per second at 1 5000 Hz symbol rate is: 20.3 GOPS 
more than 1 8 times more efficient than traditional approach. 

Also, the DLL circuitry can be eliminated since the entire channel response is calculated at the 
symbol rate. 

FFT numbers may be off by a factor of 2 larger in the number of complex multiplications neede< 
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Introduction 

Multi-User Detection (MUD) has been shown to provide a number of significant benefits[l][2]. These 
include increased system capacity, increased range, enhanced Quality of Service (QoS), improved near-far 
resistance, extended battery life, and reduced handset transmit power. This paper describes the practical 
implementation of Multi-User Detection (MUD) for the UMTS uplink using short codes. The focus is on 
practical implementation details such as efficient implementation of the calculations, processing 
requirements, latencies, MUD efficiency, and mapping to hardware. 

The use of short codes allows MUD to be performed at the symbol rate. As such MUD can be introduced 
into a conventional Base-Transceiver-Station (BTS) as an enhancement to the Matched-Filter (MF) RAKE 
receiver. The MUD processing takes the MF detection statistics, performs interference cancellation, and 
then delivers improved hard or soft-decision symbol estimates to the symbol -rate BTS processing 
functions. The MUD processing introduces only a few milliseconds latency. Because of the reduced 
computational complexity of MUD operating at the symbol rate the entire MUD functionality can be 
implemented in software on a single card or daughter card populated with a minimal number of processors. 
We present here an implementation of an iterative hard-decision Interference Cancellation (1C) algorithm 
on four Power PC 7410 processors. The processors are connected together with a high-bandwidth RACE++ 
interconnect fabric. 

In order to perform MUD at the symbol rate the correlation between the user channel-corrupted signature 
waveforms must be calculated. These correlations are stored as elements of matrices, here referred to as the 
R-matrices. Since the channel is continually changing these correlations must be updated in real time. 
There are two elements to updating the R-matrices. The first part is based on the user code correlations. 
These depend on the relative lag between the various user multipath components. It is assumed that these 
lags change with a time constant of about 400 ms. The second part is due to the fast variation of the 
Rayleigh-fading multipath amplitudes. It is assumed that these amplitudes are changing with a time 
constant of about 1.33 ms. The R-matrices are used to cancel the multiple access interference through the 
Multi-stage Decision-Feedback Interference Cancellation (MDF1C) technique. 



UMTS Uplink Multi-rate Signal Model and RAKE Processing 

We derive here the equations describing the MF outputs based, on the WCDM A transmitted waveform. 
The users accessing the system will hereafter be referred to as physical users. Each physical user is 
regarded as a composition of virtual users. Each virtual user transmits a single bit per symbol period, where 
by symbol period we mean a time duration of 256 chips (i.e. 1/15 ms). The number of virtual users, then, 
for a given physical user is equal to the number of bits transmitted in a symbol period. At a minimum each 
active physical user is composed of two virtual users, one for the Dedicated Physical Control Channel 
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(DPCCH)[3] and one for the Dedicated Physical Data CHannel (DPDCH). If the physical user is a data 
user with Spreading Factor (SF) less than 256 then there are J = 256/SF data bits and one control bit 
transmitted per symbol period. Hence for the rth physical user with data-channel spreading factor SF r , there 
are a total of 1 + 256/SF n virtual users. The total number of virtual users is denoted 

'-i["f] 

The transmitted waveform for the rth physical user can be written as 

i+j, 

*,m= EA5>.['-"»n&k[" , i (2) 

p=0 



where t is the integer time sample index, T = NN C is the data bit duration, N = 256 is the short-code length, 
N c is the number of samples per chip, and where ft = ft if the *th virtual user is a control channel and ft = 
ft if the *th virtual user is a data channel. The multipliers ft and ft are constants used to select the relative 
amplitudes of the control and data channels. At least one of these constants must be equal to 1 for any given 
symbol period m. The waveform s k [t] is referred to as the transmitted signature waveform for the kth 
virtual user. This waveform is generated by passing the spread code sequence crfn] through a root-raised- 
cosine pulse shaping filter h[t]. If the *th virtual user corresponds to a data user with spreading factor less 
than 256 then the code c k [n] still has length 256, but only N k of the 256 elements are non-zero, where N k is 
the spreading factor for the foh virtual user. The non-zero values are extracted from the code C chf2 56.64 
S sh [n][3l The W-CDMA standard actually allows for up to six DPDCHs to be multiplexed with a single 
DPCCH. This functionality is not presently incorporated in the MUD algorithms described below. 

The baseband received signal can be written 

dt] = mT V" [m] + Ht ' ] (3) 

L 
9=1 

where w[t] is receiver noise, J k [t] is the channel-corrupted signature waveform for virtual user k, L is the 
number of multipath components, and a^- are the complex multipath amplitudes. The amplitude ratios ft 
are incorporated into the amplitudes a kq -. Notice that if k and / are two virtual users corresponding to the 
same physical user then, aside from scaling the by ft and ft, a kq > and a w , are equal. This is due to the fact 
that the signal waveforms of all virtual users corresponding to the same physical user pass through the same 
channel. The waveform s k [t] is now the received signature waveform for the *th virtual user. This 
waveform is identical to the transmitted signature waveform given in Equation (2) except that the root- 
raised-cosine pulse h[t] is replaced with the raised-cosine pulse git]. 

Thus far the received signal has been match-filtered to the chip pulse. It must next be match -filtered by the 
user code-sequence filter. The resulting detection statistic is denoted here as y kt the matched-filter output 
for the fcth virtual user. Since there are K v codes, there are K v such detection statistics, which are collected 
into a column vector y[m] for the mth symbol period. The matched-filter output y//m/, for the /th virtual 
user can be written 
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where a lq is the estimate of fl ^, i lq is the estimate of r^, M is the (non-zero) length of code C/M, and 
7],lm] is the match-filtered receiver noise. Substituting r[t] from Equation (3) above gives 

The terms for m'* 0 result from asynchronous users. 



MUD Algorithm and Functions 

A vast number of MUD algorithms have been proposed [1J[2]. Many of these are too computationally 
complex to be implemented with current technology. The linear-iterative class of MUD algorithms 
[4] [5] [6] are the least computationally complex. For this class of algorithms software implementation is 
feasible. The hard-decision variants of these algorithms also enjoy a significant performance advantage in 
that they do not tend to amplify other-cell interference. The down side is that performance degrades under 
high input BER. Since channel decoding reduces the BER by orders of magnitude, it is possible to be 
operating with raw channel BERs as high as 10%. A number of methods have been proposed to address this 
issue, including the null-zone detector [4], and partial interference cancellation [4][5][6]. We employ 
partial interference cancellation in conjunction with a new thresholding technique which reduces 
computational complexity. Our method provides excellent performance under high input BER. 

The implementation of MUD at the symbol rate can be divided into two functions. The first function is the 
calculation of the R-matrix elements. The second function is interference cancellation, which relies on 
knowledge of the R-matrix elements. The calculation of these elements and the computational complexity 
are described in the following section. Computational complexity is expressed in Giga Operations Per 
Second (GOPS). The subsequent section describes the MUD IC function. The method of interference 
cancellation employed is Multistage Decision Feedback IC (MDF1C)[2][7]. 



R-matrix 

From Equation (5) above, the R-matrix calculations can be divided into three separate calculations, each 
with an associated time constant for real-time operation, as follows 
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qui f.| I *"| « p 

(6) 

2A/, m fl 



n 

where we have omitted the hats indicating parameter estimates. Hence we must calculate the R-matrices, 
which depend on the C-matrices {C lkqq lm']\ which depend on the r-matrix (Ti k [m]). The r-matrix has the 
slowest time constant. This matrix represents the user code correlations for all values of offset m. For the 
case of 100 voice users the total memory requirement is 21 MB based on two bytes (real and imaginary 
parts) per element. This matrix is updated only when new codes (new users) are added to the system. Hence 
this is essentially a static matrix. The computational requirements are negligible. The most efficient method 
of calculation depends on the non-zero length of the codes. For high data-rate users the non-zero length of 
the codes is only 4 chips long. For these codes a direct convolution is the most efficient method to 
calculation the elements. For low data-rate users it is more efficient to calculation the elements using the 
FFT to perform the convolutions in the frequency domain. 

The C-matrix is calculated from the r-matrix. These elements must be calculated whenever a users delay 
lag changes. For now assume that on average each multipath component changes every 400 ms. The length 
of the gl] function is 48 samples. Since we are oversampling by 4, there are 12 muluply-accumulations 
(real x complex) to be performed per element, or 48 operations per element. When there are 100 low-rate 
users on the system (200 virtual users) and a single multipath lag (of 4) changes for one user a total of 
(1 .5)(2)A\,L/V V elements must be calculated. The factor of 1.5 comes from the 3 C-matrices (m' = -1,0, 1), 
reduced by a factor of 2 due to a conjugate symmetry condition. The factor of 2 results because both rows 
and columns must be updated. The factor N v is the number of virtual users per physical user, which for the 
lowest rate users is N v = 2. In total then this amounts to 230400 operations per multipath component per 
physical user. Assuming 100 physical users with 4 multipath components per user, each changing once per 
400 ms gives 230 MOPS. 

The R-matrices are calculated from the C-matrices. From Equation (6) above the R-matrix elements are 

where a k are L x 1 vectors, and C lk lm'] are L x L matrices. The rate at which these calculations must be 
performed depends on the velocity of the users. The selected update rate is 1 .33 ms. If the update rate is too 
slow such that the estimated R-matrix values deviate significantly from the actual R-matrix values then 
there is a degradation in the MUD efficiency. Figure 1 below shows the degradation in MUD efficiency 
versus user velocity for an update rate of 1.33 ms, which corresponds to two WCDMA time slots. The plot 
indicates that there is high MUD efficiency for users with velocity less than about 100 km/hr. The plot 
indicates that the interference corresponding to fast users is not cancelled as effectively as the interference 
due to slow users. For a system with a mix of fast and slow users the resulting MUD efficiency is a average 
of the MUD efficiency for the various user velocities. 
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Figure J. MUD efficiency versus user velocity in km/hr 

From Equation (7) the calculation of the R-matrix elements can be calculated in terms of an X-matrix 
which represents amplitude-amplitude multiplies 

= tr[c^ k [mrX^~tr[c: k [m]X^ (8) 
C lk W] = C*\m^+jCl k [rri} 

The advantage of this approach is that the X-matrix multiplies can be reused for all virtual users associated 
with a physical user and for all m' (i.e. m = 0, 1). Hence these calculations are negligible when amortized. 
The remaining calculations can be expressed as a single real dot product of ^ngth 2L - 32. The 
calculations are be performed in 16-bit fixed-point math. The total operations is thus 1.5(4)(K V L) - 3.S4 
Mops. The processing requirement is then 2.90 GOPS. The X-matrix multiplies when amortized amount to 
an additional 0.7 GOPS. The total processing requirement is then 3.60 GOPS. 



MDFIC 

From Equation (5) above the matched-filter outputs are given by 

y,[m] = r,,[0Wm] + ^ (9) 

The first term represents the signal of interest. All the remaining terms represent Muluple Access 
Interference (MAI) and noise. The MDFIC algorithm iteratively solves for the symbol estimates b,[m] 

using 
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with initial estimates given by hard decisions on the matched- filter detection statistics, b,[ m ) = sign^m)}- 
The MDF1C [7] technique is closely related to the SIC and PIC technique. Notice that new estimates b,[m) 
are immediately introduced back into the interference cancellation as they are calculated. Hence at any 
given cancellation step the best available symbol estimates are used. This idea is analogous to the Gauss- 
Siedel method for solving diagonally dominant linear systems. 

The above iteration is performed on a block of 20 symbols, for all users. The 20-symbol block size 
represents two WCDMA time slots. The R-matrices are assumed to be constant over this period. 
Performance is improved under high input BER if the sign detector in Equation (10) is replaced by the 
hyperbolic tangent detector [6]. This detector has a single slope parameter which is variable from iteration 
to iteration. 

The three R-matrices (R[-l], R[0] and R[l]) are each K v x K v in size. The total number of operation then is 
6K V 2 per iteration. The computational complexity of the MDFIC algorithm depends on the total number of 
virtual users, which depends on the mix of users at the various spreading factors. For K v = 200 users (e.g. 
100 low-rate users) this amounts to 240,000 operations. In the current implementation two iterations are 
used, requiring a total of 480,000 operation. For real-time operation these operations must be performed in 
1/1 5 ms. The total processing requirement is then 7.2 GOPS. Computational complexity is markedly 
reduced if a threshold parameter is set such that 1C is performed only for values \ydm]\ below the threshold. 
The idea is that if \ydm]\ is large there is little doubt as to the sign of b,[m] y and IC need not be performed. 
The value of the threshold parameter is variable from stage to stage. 

Mapping to Hardware 

The above calculations are performed on a single 9"x6" card populated with four Power PC 7410 
processors. These processors employ the AltiVec SIMD vector arithmetic-logic unit, which has 32 128 7 bit 
vector registers. These registers can hold either 4 32-bit floats, 4 32 bit ints, 8 16-bit shorts, or 16 8-bit 
chars. Two vector SIMD operation (multiply and accumulate) can be performed by clock. The clock rate 
used for the current implementation is 400 MHz. The processors, however, can be operated at 500 MHz 
with higher clock speeds in the near future. Each processor has 32KB of LI cache and 2MB of 266MHz L2 
cache. The maximum theoretical performance of these processors is thus 3.2 GFLOPS, 6.4 GOPS (16-bit), 
or 12.8 GOPS (8-bit). The current implementation used a combination of floating-point, 16-bit fixed-point 
and 8-bit fixed-point calculations. 

The four PPC7410 processors are interconnected with a RACE++ 266MB/s 8-port switched fabric as 
shown in Figure 2. The high bandwidth fabric allows transfer of large amounts of data with very low 
latency so as to achieve efficient parallelism of the four processors. The maximum theoretical performance 
of the card is thus 5 1 .2 GOPS. 
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Figure 2. Partitioning of MUD functions across four processors 

As shown in Figure 2 the MDF1C and C-matrix calculations are allocated to a single processor. The other 
three processors are given to the R-matrix calculations which are considerably more complex. 

MUD BER Performance 

A sample of the Bit Error Rate (BER) performance of the MUD algorithm is shown in Figure 3. For 
comparison the matched-filter BER is also shown. The figure shows that MUD doubles system capacity. 
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Figure 3. Log 10 bit error rate versus system capacity for matched 
filter {blue) and multiuser detection (red) 



The above performance is based on the following assumptions: 

• A single receive antenna is used 

• The target BER is 0.001 



Page No. 131 



EV 093 931 868 US 
Page No. 158 



• The percentage of systems users in handoff is 30% 

• Other-cell interference is 35% of intra-cell interference. This is lower than the typical value (0.60) 
used. The reason is that the other-cell users in handoff with the cell of interest are included in the intra- 
cell interference. This is because the cell of interest is processing these users and hence can cancell 
there interference using MUD. 

• A 4-tap multipath channel is used. Each tap is Rayleigh fading. The composite power of all paths is 
perfectly power controlled. 

• The channel amplitude estimation error is 10% 

• The channel delay estimation is Va chip 

• The activity factor for voice is 0.40 

• The relative amplitude of the control channel is p c = 0.5333 



Conclusions 

The current state of processor technology is such that iterative hard-decision MUD for the UMTS uplink 
can be implemented in software on a single card or daughter card populated with four Power PC 7410 
processors, connected together with a high-bandwidth RACE++ interconnect fabric. The use of short codes 
allows MUD to be performed at the symbol rate. The advantage of symbol-rate processing is that MUD can 
be introduced into a BTS as an enhancement to the conventional RAKE receiver. The MUD processing 
takes the MF detection statistics, performs interference cancellation, and then delivers improved hard or 
soft-decision symbol estimates to the symbol-rate BTS processing functions. The latency introduced is only 
a few milliseconds. In order to perform MUD at the symbol rate the R-matrices must be updated in real 
time. There is a minimal degradation in MUD efficiency if these elements are updated at a rate of once per 
1.33 ms. The R-matrices are used to cancel the multiple access interference through the MDFIC 
interference cancellation technique. At a BER of 0.001 the use of the above MUD technique doubles 
system capacity. 
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1. Introduction 

This report briefly describes long-code Multi-User Detection (MUD). Section 2 describes 
the long-code signal model, which is different from the short-code model. Section 3 
describes the matched-filtering operation for long codes and gives a lower bound on the 
GOPS required for long-code symbol-rate MUD. The lower bound is 19.7 TOPS (i.e. Tera 
Operations Per Second; 1 TOPS = 1000 GOPS). Because of the extreme computational 
complexity of symbol-rate MUD for long codes regenerative MUD is examined. It is shown 
in Section 4 that although regenerative MUD operates at the chip rate, the overall 
complexity is lower for long codes. Two methods are examined. The first method is a 
somewhat straight-forward implementation of regenerative MUD. The required 
computational complexity is shown to be 774.6 GOPS for 100 users. The second method 
is based on combining impluse trains and subsequently raised-cosine filtering the 
composite signal. The total computational complexity is shown to be 109.6 GOPS for 100 
users. Regenerative MUD is linear in the number of users, so that if the number of users 
is reduced to 64 the complexity drops to 70.1 GOPS. The complexity is also linear in the 
number of multipaths subtracted, so that if the number of multipaths subtracted is reduced 
from 4 to 2 the complexity drops to 35.1 GOPS. It may be desirable for MUD performance 
to subtract only the two largest multipaths due channel amplitude estimation errors. The 
above complexity figures are for a single interference cancellation stage. For two stages 
the computation is doubled. To perform regenerative MUD the baseband antenna stream 
data must be brought onto the MUD board. The required bandwidth is 123 MB/s. Note that 
the figures given above can perhaps be reduced through a clever implementation. A block 
diagram of regenerative MUD is shown to facilitate an investigation into the feasibility of 
an FPGA or ASIC implementation. 



2. Signal Model 
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The received signal model for short-code WCDMA is given in [1]. When long codes are 
used the signal model is different since effectively the codes change from symbol to 
symbol. We present here the WCDMA signal model for long codes. The baseband 
received signal can be written 

dt] = %2,?*[t-mT k y> k [m]+m (1) 

where t is the integer time sample index, T k = N k N c is the data bit duration, which depends 
on the user spreading factor, N k is the spreading factor for the /cth virtual user, N c is the 
number of samples per chip, K \s the total number of physical users, w[t] \s receiver noise, 
and where s^lt] is the channel-corrupted signature waveform for the /cth virtual user over 
the nth symbol period. The concept of virtual users is used to account for both the 
DPDCH and the DPCCH. Hence if there are K physical users, then there are K v = 2K 
virtual users. The user signature waveform and hence the channel-corrupted signature 
waveform vary from symbol period to symbol period since long codes by definition extend 
over many symbol periods. For L multipath components the channel-corrupted signature 
waveform for virtual user k is modeled as 

^M = iX^['-^J ( 2 ) 

P =\ 

where are the complex multipath amplitudes. The amplitude ratios p k are incorporated 
into the amplitudes a kp . Notice that if k and / are virtual users corresponding to the 
DPCCH and the DPDCH of the same physical user then, aside from scaling the by p k 
and p h a kp and a fp , are equal. This is due to the fact that the sig nal waveforms for both the 
DPCCH and the DPDCH pass through the same channel. 

The waveform s km [t] is referred to as the signature waveform for the kth virtual user over 
the mth symbol period. This waveform is generated by passing the spreading code 
sequence c km [n] through a pulse-shaping filter g[t] 

= J,g[t-rN c ]c k [r + mN k ] 

where g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine pulse as 
opposed to a root-raised-cosine pulse, the received signal r[t] represents the baseband 
signal after filtering by the matched chip filter. 



3. Matched filter 

The received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. The resulting detection statistic is 
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Page ^^161^ ^ y^m] % the matched-filter output for the kih virtual user over the mth 
symbol period. Since there are K v codes, there are K v such detection statistics, which are 
collected into a column vector y[m]. The matched-filter output y,[m] f for the Ith virtual user 
can be written 



[q=\ Z/V / «=0 J 



(4) 



where a£ is the estimate of af q , t lq is the estimate of t„ , and r),[m] is the match-filtered 
receiver noise. Substituting rft/from Equation (1) above gives 



y,[m] = Rt 



L j N,-\ 



XXX<V** +r ( *, ; ,[m,m , ]]b J [m•] 



= Re 



-f nfa/V, + f /? + mT, ] 

l i ^h!" 2 * 1 

Ztf ~r X IllVi-W +T ap [m.m I ]Wm'] 

2>r r i l r i 

=xHlI<^Ih^r^^ [ ^ +T «- [m ' m " llc ' Jn] 



■b k [m']\ +T7,[m] 



2* Li. 



[«. m' ] = — £ J ta [nN c + x lkqp [«,«•]]• 4 [«] 



(5) 



In order to subtract interference we must, at a minimum, calculate Ci kqp [m,m']1or all virtual 
users and for all multipath components. A lower bound on the computational complexity 
can be determined by considering the above calculations for synchronous users. For 
synchronous users, all at the highest spreading factor, the required number of operations 
to calculate C, kqp [m,m'] is 8(256)(2KL) 2 = 1.31 Gops for K = 100 and L = 4. For real time 
operation 15000 such computations must be performed every second. This amounts to 
19.7 TOPS (i.e. Tera Operations Per Second). 



4. Regenerative MUD 
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Because of the extreme computational complexity of symbol-rate MUD for long codes it is 
advantageous to resort to regenerative MUD when long codes are used. Although 
regenerative MUD operates at the chip rate, the overall complexity is lower for long 
codes. For regenerative MUD the signal waveforms of interferes are regenerated at the 
sample rate and effectively subtracted from the received signal. A second pass through 
the matched filter then yields improved performance. It turns out that the computational 
complexity of regenerative MUD is linear in the number of users. 

The received signal can be written 

k=\ m p~\ 

= 2,r k [t)+m (6) 

Jt=l 

L 

m p~\ 

Subtracting interference gives a cleaned-up signal x{t] 

x\n=r[t\- 2,m 



IK 



Jfc=l 

= r[/]-r[f]+r ( [r] 
= r,[t} + r rts [t] 

r r Jt] = r[t]-f[t) 



k=\ 

L 



Two methods are presented below for performing regenerative MUD. 



First Method 

In order to subtract interference we must reconstruct (regenerate) the waveform $ km [t] as 
given in Equation (3). The waveform can be reconstructed using 
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jj»]=ls['-^k[^] 

= "if i «f - + ?> n < ]c *» [4p+j] 

p=0 j=o 

= "'XIgt'-4pN f -;W r ]c bv m 

p=0 y=0 

^ /4_I " /ox 

p=0 



The idea is that s km [t] can be represented as a summation of shifted waveforms s kmp [t], 
which are entirely specified by the 8 binary numbers comprising the complex sequence 
CknpO] of length 4. Hence there are only 2 8 = 256 such waveforms. For what follows we 
assume that the signals are sampled at N c = 8 samples per chip. Each is of length 96 + 
3(4) = 108 samples assuming that g[t] is of length 96. For 2 bytes per sample (real and 
imaginary parts) the total memory requirement is 216*256 = 55296 bytes, which spills out 
of L1 cache, but fits entirely in L2 cache. 

To generate r k [t] for a single symbol period, 64 of these waveforms must be read from 
memory. For each of these 64 waveforms L complex macs are required per sample per 
symbol period. Hence 64(8L)(108) operations are required per symbol period. For L = 4 
this amounts to 64(32)(108) = 221184 operations per symbol period (1/15 ms), or 3.32 
GOPS. The formation of r res [t]\hen requires 2K times this, or 3.32(200) = 664 GOPS for K 
= 100 physical users. To form ?,[/] + /;„[/] requires an additional 2(96+255*4) = 2232 
operations per symbol period per virtual user, or another 6.7 GOPS. Finally, the matched 
filter operation needs to be performed for each user, which from Equation (4) requires 
NLK complex macs (N = 256), or 256(4)(100)(8)*15000 = 12.3 GOPS. The GOPS figures 
above are for a single antenna. For two antennas the operations are doubled. Hence the 
total computational complexity is 2(664 + 6.7 + 12.3) = 1.37 TOPS. This is for a single- 
stage MPIC algorithm. For two stages the computation is doubled. 

To perform regenerative MUD the baseband antenna stream data must be brought onto 
the MUD board. The required bandwidth is 

[2 Bytes(complex)/Sa/Ant][2 Ant][8 Sa/chip]f3.84 Mchips/ second] = 123 MB/s 



Second Method 

The second method is to represent the waveform for each multipath for each user as a 
complex impulse train with N c = 8 samples per impulse. The complex amplitude of each 
impulse is the product of the complex chip, complex multipath amplitude and the binary 
(real) data bit estimate. These 2KL complex streams (times 2 for 2 antennas) are added to 
form a composite signal. Since this composite signal is a sum many impulse trains, all 
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asynchronous, the composite signal is a dense (i.e. no systematic zeros) signal at the 
sample rate. A block diagram of the processing is shown in Figure 1 . 
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Figure 1. A block diagram of the long-code MUD processing 
From Equations (7) and (8) 
r ns [t] = rlt]-r[t] 

2K 

tt]«£r 4 [f] 



2K L 



k=\ m p=\ 

2K L N k -l 

2K L N k -\ ^ 

= tlLKl<lL8l'-\Hr + mN k )N c ]c k [r + mN k y> k [l(r + mN t )/Nj 

*=] p=l m r-0 

2K L 

= X2X I *[' " f » - nN c ^ Whin/N k J 

k=\ p=] n 

2K L A 

k=) p=\ r n 

2K L 

*=l r p=] n 

^g[r]a[t-r] 

r 



(9) 



where aft/ is the composite signal. For each symbol period this requires 256(1 0){2KL) 
operations per antenna. For two antennas this amounts to 5120(200)(4) = 4096000 
operations per symbol period, or 61 .4 GOPS. 
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The estimate of the received signal is then determined by passing the composite signal 
through the raised-cosine filter g[t] of length 96, which requires 96 real macs, or 192 real 
operations, per sample per real stream. There are a total of 4 real streams (2 antennas, 
real and imaginary streams). The total GOPS then for N c = 8 samples per chip is 
192(4)(8)(3.84M) = 23.6 GOPS. 

The final step is to pass the cleaned-up signal *,[/] = r,[/] + r rfS [t] through the matched-filter 
(i.e. rake receiver) which gives the improved detection statistic 

[9=1 IN I n=0 J 

= R 4t< —InWc +** +«r,]-c;w| 

= R 4lX ~ [lt«/,^[^ +** "V Hm-m')TMm'Acl[n]WyZlrn) 

I 9=1 n=0 _ m 9=1 J J 

[9=1 2/V / n=0L?'=l J J 

[9=1 9=1 ZN l «=0 J 

= Re |t<^} 



= M ( 2 -ft,[m] + ^ J [m] 



4/ S Re|t<a ( ,} 

[9=1 Z/V / n=0 J 



(10) 



The matched filter operation requires NLK complex macs, or 256(4)(100)(8)*15000 = 12.3 
GOPS. The GOPS figures above are for a single antenna. For two antennas the 
operations are doubled, giving 24.6 GOPS. The total computational complexity for the 
second method is then 61.4 + 23.6 + 24.6 = 109.6 GOPS. 
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To: Wireless Communications Group 



From: J. H. Oates 

Subject: R-matrix GOPS Date: June 21 , 2000 

1. Introduction 

This report investigates a number of different methods for calculating the R- 
matrix elements. There are two parts to the calculation. First is the calculation of 
the user code correlations at lag offsets determined by the searcher receivers. 
This calculation must be performed every time a multipath component changes 
to a new lag. The assumption used here is that every 100 ms one multipath 
component changes to a new lag for each user. Hence, if each user has 4 
multipath lags, then all R-matrix elements will have changed after 400 ms. The 
validity of this assumption will have to be tested with measured data. Note that 
the WCDMA standard call out a test with 2 multipath components, where one lag 
changes every 191 ms [1]. The second part is the actual calculation of the R- 
matrix elements, which requires a double summation of code correlations over all 
multipath components, with each term scaled by the Rayleigh-fading multipath 
amplitudes. The maximum time period to perform this calculation is about 1.33 
ms. Hence there are two parts to the calculation, each with a different update 
rate. 

Section 2 is devoted the first part of the calculation, the code correlations. 
Section 3 covers the actual calculation of the R-matrix elements. 



] 
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2. Calculation of User Code Correlations 

The R-matrix elements can be expressed as [2] 

q=l q'=\ 

ZrJ t „ 
2/V, „ p 

where C kqq > [m f ] is a five-dimensional matrix of code correlations. Both / and k 
range from 1 to K v , where K v is the number of virtual users. If there are K physical 
users, all operating at the highest spreading factor, then there are K v = 2K virtual 
users. For now consider K = 128 so that K v = 256, The indices q and q' range 
from 1 to L, the number of multipath components, which for this report is 
assumed to be equal to 4. The symbol period offset m' ranges from -1 to 1 . The 
total number of matrix elements to be calculated is then 
N c = 3(K V L) 2 = 3(1024) 2 =3M complex elements, or 24 MB if each element is a 
float. This number is reduced, however, due to the symmetries 



(2) 



= — x X g[(« - p)^ f + m ' T + f 4 - v i c * irf • c ' w 



= ^-C* [m'] 

so that it is sufficient to store elements for offsets m' = 0,/. The memory 
requirement is then 16 MB if each element is a float. If the elements are stored 
as bytes the requirement is reduced to 4 MB. 



2 

Page No. 143 



EV 093 931 868 US 
Page No. 170 



Referring to Equation 1, line 2, it is evident that each element of [m'] is a 
complex dot product between a code vector c/ and a waveform vector s kqq : The 
length of the code vector is 256. The length of the waveform vector is L g + 255N C , 
where L g is the length of the raised-cosine pulse vector g[t] and N c is the number 
of samples per chip. The values for these parameters as currently implemented 
are L g = 48 and N c - 4. The length of the waveform vector is then 1068, but for 
the dot product it is accessed at a stride of N c = 4, which gives effectively a 
length of 267. Note that the code and waveform vectors in general do not entirely^ 
overlap. Also note that an increment or decrement in the symbol offset index m' 
slides the waveform vector 256 elements to the left or right respectively. Figure 1 
shows that the total number of complex macs (cmacs) for all three (m'= -1, 0, 1) 
dot products is 267, irrespective of any relative offset. 
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Figure 1. Overlap of waveform and code vectors. The total 
number of complex macs (cmacs) for all three (m'=-1,0, 1) 
dot products is 267, irrespective of any relative offset. 



Hence for any given combination of indices Ikqq' the three elements C !kqq ' [m'] f 
corresponding to m' = -1 , 0 and 1 require 267 cmacs to calculate all three. Since 
there are (KvLf combinations of indices, the calculation of all elements C !kqq ' [m'J 
requires (K v Lf (267) cmacs. Given the symmetry condition, only half of the 
elements need to be calculated, and noting that each cmac requires 8 operation 
to perform, the total number of operations required is 

N op5 = j(K v L) 2 (267)(8) = i(1024) 2 (267)(8) = 1.12 G ops (3) 

The total number of GOPS (Giga Operations Per Second), then, given the 400 
ms update rate is 

|(^L) 2 (267)(8)o P 5 _ i(1024) 2 (267)(8)^ _ , ^ gpfiC (4) 
G0PS 400ms 400ms 

The next section addresses the calculation of the R-matrix elements. 
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3. Calculation of R-matrix Elements 

Consider the calculation of the R-matrix elements 

Wm'K=itRefcVC^[m']} (5) 

The total number of matrix elements to be calculated is N p =3K V 2 . This number 
is reduced, however, due to the symmetries 

Pul-m'}A, = ii*efcAr C u „.[-m>]} = tt*\*& ^O^l 

(6) 

so that the total number of matrix elements to be calculated is N p = \ Kl . 

Now let us consider the operations per element. Dropping explicit reference to 
the symbol period offset [m], the matrix elements are 

A brute-force calculation requires L 2 {6 + 3 + 1) operations (1 complex multiply, 
one half-complex multiply - i.e. the real part ~ and one real add, or 6 real 
multiplies and 4 real adds). The total operations is then 

For a vehicular speed of 120 km/h the Doppler frequency is 216.67 Hz for a user 
at frequency 1950 MHz. The coherence bandwidth is thus 433.33 MHz, and the 
corresponding coherence time is about 2.3 ms. Hence the multipath amplitudes 
are changing with a time constant of about 2 ms, and consequently the second 
part of the calculation must be updated at least every 2 ms. The channel 
amplitudes are calculated on a time slot by time slot basis. Each time slot is 
10/15 = 2/3 = 0.67 ms. Hence 2 ms equals 3 time slots, whereas two slots equals 
1 .33 ms. Figures 2 and 3 below show the MUD efficiency versus user velocity for 
2 ms and 1.33 ms update times respectively. The plots show that to be able to 
effectively handle high velocity users the update time should be 1.33 ms. When 
users are at various speeds the interference from low speed users is cancelled 
more effectively than the interference from high speed users. The MUD efficiency 
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will then be an average of the MUD efficiency corresponding to each user's 
speed. 



Calculations updated every 2 ms (3 time slots) 




User Velocity (kmph) 



Figure 2. MUD efficiency versus user velocity for a 
2 ms R-matrix update time. 



Calculations updated every 1.33 ms (2 time slots) 




0.2 • ; ; 

0.1 '. : ■ • 

, , i i i 

0 20 40 60 80 100 120 

User Velocity (kmph) 

Figure 3. MUD efficiency versus user velocity for a 
1.33 ms R-matrix update time. 



The calculations below are based on a 1.33 ms update time. Note that most of 
the capacity and coverage benefits calculated for MUD so far have assumed 
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70% MUD efficiency. The 1 .33 ms update time is sufficient to achieve 70% MUD 
efficiency. The total GOPS are then, 



l(g v L) 2 (10) _ 1.5(256-4) 2 (10) _ llg ^ p , 
G0K 1 33 ms 1.33 m 



(9) 



where we have assumed L = 4 multipath components. A better way to perform 
this operation is 



«=1 l 9=1 J 



(10) 



The inner sum is a matrix-vector multiply, hence requiring L cmacs, and the 
outer sum is the real part of a compex dot product, which requires L half-cmacs. 
The total is then (L 2 + L/2) = 1.125 L 2 cmacs (for L = 4) times 8 operations per 
cmac, or 9L 2 operations, which gives 



l(tt,L) 2 (9) _ 1.5(256 ^(9) _ inA ^ pc 
G0P5 1.33 ms 1.33 ms 



(11) 



The above calculations are represented in terms of complex numbers, which are 
not directly calculable. To express the above equations explicitly in terms of real 
numbers it is convenient to cast the calculations into matrix form 



LI, 



= Re 







"' C lk\L 




^ Ik 21 




... r 

^/*2£ 


a k2 






' ' * ClkLL . 





= Re{z, w C, k a J 



(12) 









Q - 







r 

^!k\\ 


Ctktt 


C tk\L 


r 

^lk2\ 


C}k22 


"' C{k2L 




C\kL2 


' ' ' CfkLL 



The quadratic form a" C !k a k can be expressed 



6 

Page No. 147 



EV 093 931 868 US 
Page No. 174 



Refe" C lk -a^Refe" -jaj][C r + JC, ] [b, +jb,)} 

= Ret r r - jaj ] [C r b r - Cfr + j(C r b t + C,&,)]} 

= R RCA - a , r C,*, + a , r CA +a, r CA 1 
"I + XflJC^ + a r r C.p r - afC r b r + aJCfi,)] 



V aft 



= a T C b 



(13) 



The matrix-vector multiplication requires (2lf macs. The dot product adds (2L) 
macs so that the total is (2Lf + (2L) macs. For L = 4 we have 1. 125(2Lf macs = 
4.5L 2 macs = 9L 2 operations. The total GOPS are then 



N. 



GOPS 



_ j(/f v L) 2 (9)_ 1.5(256 4) 2 (9) 



1.33 ms 



1 .33 ms 



= 10.6 GOPS 



(14) 



Now consider a different formulation which attempts to reuse the amplitude- 
amplitude multiplications. Consider the calculation a T C b 



a T C b = tr[a T C b]= tr[c ■ {ba T )]= tr[C ■ X] 
X =ba T 



(15) 



The calculations to produce matrix X are pure multiplications, but the elements, 
once calculated, can be reused for the other virtual users corresponding to the 
same physical users. For voice-only users there are 2 virtual users per physical 
user. For data users there can be up to 65 virtual users per physical user. For 
now, however, we stay with our 128 voice-user scenario. To calculate X, then, 
requires (2Lf = 4L Z multiplications. This calculation is performed once per pair of 
physical users, so the total number of operations is 



N m = (KL) 2 (4) = (K V L) 2 (\) = f(*,D 2 (j) 



(16) 
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Effectively, then, X requires (2/3)1? operations. The details to calculate a T C b 
are 



a 1 C b = tr[CX]=tr 



2L 



(17) 



where c, is the Ah row of C and x, is the Ah column of X. Hence we have 2L dot 
products of length 2L, which require (2Lf macs = 8L 2 operations. To calculate 
a H C b then requires 8L 2 + (2/3)L 2 = 8.67L 2 operations, which gives 



Ncops = iW^67) = 1.5(256-4)H8-67) = ^ ^ 



1 .33 ms 



1.33 m 



A better way to perform this calculation is as follows 



q=\ q=\ 



(18) 



= ttRe|;x;.-;x;,)(c;, + ;c;,)} 



q=\ q'=i 



L L 



q=\ q'= 



(19) 



k 99 

X . — a„ a. + a n -a. 

qq q q 9 9 

Xi /' r r i 

qg .=a q a q .-a g a q . 



where for convenience we have dropped A k , the Ik subscripts and the hat 
symbols. The calculation of X requires 



AT, = (KL) 2 (6) = ( K V L? (6 / 4) = \ ( K V L) 2 (1) 



(20) 



operations. Note that, once the X values are calculated, the remainder of the 
calculation is a long dot product of length 2L 2 , hence requiring 2L 2 macs, or 4L 2 
operations, which gives 



8 

Page No. 149 



EV 093 931 868 US 
Page No. 176 



i(^L) 2 (5) ^ 1.5(256.4) 2 (5 )_, Q ^ p , (21) 
G0PS \33ms 1.33 ms 



Dual Diversity Antennas 

When dual diversity antennas are employed, the calculation of the R-matrix 
elements becomes 

9=1 q'=\ 9=' 9=1 



9 =1 9=] 



(22) 



9=1 9=1 



X 'qq' = X lqq' + X 2qq 



To calculate Xfor dual diversity antennas, then, requires 



operations. The remainder of the calculation is again a long dot product of length 
2L 2 requiring 4L 2 operations, which gives 

_ j(* v L) 2 (6.33) = 1 .5(256 4) 2 (6.33) _ 75GOps (24 ) 
C0PS 1.33 ms 1.33 ms 



Reuse of C data 

So far we have not addressed the problem associated with a lack of data reuse, 
which renders our calculations I/O limited. The C data can be reused by 
introducing extra latency into the calculations. For a given user, a single 
multipath component changes on average once every 100 ms, or once every 150 
slots. Suppose we collect and save in cache 4 amplitude estimate vectors a k [q], 
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where q is the 2 ms update index. The total latency is then 8 ms = 1 2 t me slots. 
During this time the probability that a multipath lag changes .a (B imJJIOO ms) = 
08 The probability that the matrix C ft changes is then = 1 -(1-0.08) - u.id. 
Hence for most matrices C ft we will be able to calculate 

«,>]•<:.• ajd < 25 > 

for 12 time slots q for only one read of C* from memory. The penalty for this 
reuse is the 8 ms of latency incurred. 
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To: John Oates, John Greene, Alden Fuchs, Frank Date: 31-AUG-2000 

Lauginiger 
From: Mike Vinskus 

Subject: Theoretically optimum load balancing for the R FileRef: mjv-9.doc 
matrix calculations 

This memo describes the calculation of optimum R matrix partitioning points in 
normalized virtual user space. These partitioning points provide an equal, and hence 
balanced, computation load per processor. The computational model of the R matrix 
calculations does not include any data access overhead or caching effects. It is shown 
that a closed form recursive solution exists that can be solved for an arbitrary number ot 
processors. 

Although three R matrices are output from the R matrix calculation function, only half of 
the elements are explicitly calculated. This is due to the symmetry condition that exists 
between R matrices: 

R lk (m) = ZR kJ {-m). 

In essence only two matrices need to be calculated. The first one is a combination of 
R(l) and R(-l). The second is the R(0) matrix. In this case, the essential R(0) matrix 
elements have a triangular structure to them. The number of computations performed to 
generate the raw data for the R(l)/R(-1) and R(0) matrices are combined and optimized as 
a single number. This is due to the reuse of the X matrix outer product values across the 
two R matrices. Since the bulk of the computations involve combining the X matrix and 
correlation values, they dominate the processor utilization. These computations are used 
as a cost metric in determining the optimum loading of each processor. 

The optimization problem is formulated as an equal area problem, where the solution 
results in each partition area to be equal. Since the major dimens.ons of the R matnces 
are in terms of the number of active virtual users, the solution space for this problem is ; in 
terms of the number of virtual users per processor. By normalizing the solution space by 
the number of virtual users, the solution is applicable for an arbitrary number of virtual 
users. 
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Figure 1: Normalized R matrix computation model. 

Figure 1 shows the model of the normalized optimization problem. The computations for 
the R(l)/R(-1) matrix are represented by the square HJKM, while the computations for 
the R(0) matrix are represented by the triangle ABC. From geometry, the area of a 
rectangle of length b and height h is 

A, =bh. 

For a triangle with a base width b and height h, the area is calculated by 

A =-bh. 
' 2 

When combined with a common height a h the formula for the area becomes 

A i =A ri +A„. 

1 2 

= a. +-af 
' 2 ' 

The formula for ,4, gives the area for the total region below the partition line. For 
example, the formula for A2 gives the area within the rectangle HQRM plus the region 
within triangle AFG. For the cost function, the difference in successive areas is used. 
That is 

2?, =4-4-i 

1 2 1 2 

= -a;+a,--a M -a t _ i 
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For an optimum solution, the B, must be equal for 1 = 1,2, N, where N is the number 
of processors performing the calculations. Because the total normalized load is equal to 
A N , the loading per processor load is equal to A N IN. 

i*=A= 3 fori=l,2,...,M 
' N 3 2N 

By combining the two equation for B>, the solution for a ( is found by finding the roots of 
the equation: 



1 2 1 2 



— — = 0. 
2N 



The solution for a, is: 



a, =-\±^\ + aU 



+ 2a, , +— ,for/= 1,2,...,//. 
N 



Since the solution space must fall in the range [0, 1], negative roots are not valid 
solutions to the problem. On the surface, it appears that the a, must be solved by first 
solving for case where i = 1 . However, by expanding the recursions of the a, and using 
the fact that a 0 equals zero, a solution that does not require previous a, , i = 0, 1, n-\ 
exists. The solution is: 



Table 1 shows the normalized partition values for two, three, and four processors. To 
calculate the actual partitioning values, the number of active virtual users is multiplied by 
the corresponding table entries. Since a fraction of a user cannot be allocated, a ceiling 
operation is performed that biases the number of virtual users per processor towards the 
processors whose loading function is less sensitive to perturbations in the number of 
users. 

Table 1 : Normalized partition locations for two, three, and four processors. 
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-1 + V3 (0.7321) 
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™ (0.8028) 



- Mercury Computer Systems Proprietary Information - 
-3- 

PageNo. 154 



EV 093 931 868 US 
Page No. 181 




_ M Computer Systems, Inc 

Mercury 

199 Rivemeck Koad 

Chelmsford, MA 01824-2820 

(978) 256-1300 • Fax (978) 256-3599 

http://www.mc.com 



To: Jonathan Schonfeld Date: 23-FEB-2001 
From: Nmf 

Subject: Degraded mode of operation for the MUD File Ref: mjv-01 8- 

algorithm degraded_mode_desc.doc 



Reference [1] showed that the load balancing for the R matrix calculations resulted in a non -uniform 
partitioning of the rows of the final R matrices over a number of processors. In summary, the 
partition sizes increase as the partition starting user index increases. 

When the system is running at full capacity (i.e. the maximum number of users is processed while 
still within the bounds of real-time operation ) and a computational node has a failure, the impact can 
be significant. 

This impact can be minimized by allocating the first user partition to the disabled node. Also the 
values that would have been calculated by that node are set to zero. Tliis reduces the effects of the 
failed node. Also, by changing which user data is set to zero (i.e. which users are assigned to the 
failed node ) the overall errors due to the lack of non-zero output data for that node are averaged 
over all of the users, providing a "soft" degradation. 

References: 

[1] M. Vinskus. "mjv-009: Theoretically optimum load balancing for the R matrix calculations." 
31-AUG-2000. 

[2] M. Vinskus. "mjv-010: Preliminary degraded MUD operation results." 19-OCT-2000. 
[3] J. Oates. "jho-001: MUD Algorithms", 25 -APR-2000 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: Methods for Calculating the C-matrix Elements Date: November 1 3, 2000 
1. Direct Method 

The direct method for calculating the C-matrix elements is 

rJm1 = ttRefe^.-C /w [m']} 

(1) 



Symmetry 



Cftrftm'] = t^X^c + m'T+f lq -i k ,V c;w 



Due to symmetry there are elements to calculate. Assuming all users 

are at SF 256, each calculation requires 256 cmacs, or 2048 operations. The 
probability that a multipath changes in a 10 ms time period is approximately 
10/200 = 0 05 if all users are at 120 kmph. Assuming a mix of user velocities, 
let's say the probability is 0.025. Since the C-matrix elements represent the 
interaction between two users, the probability that C-matrix elements change m a 
10 ms time period is approximately 0.10 for all users are at 120 kmph, or 0.05 for 
a mix of user velocities. The GOPS are tabulated in Table 1 below. 
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The C-matrix elements also need to be updated when the spreading factor 
changes. The spreading factor can change due to 

• AMR codec rate changes 

• Multiplexing of DCCH 

• Multiplexing data services 

For lack of a better number, assume that 5% of the users, hence 10% of the 
elements change rate every 10 ms. 



K v 


High velocity 
users 


1.5(K v Lf 


Gops 


Percentage 
change 


GOPS 


200 


100% 


960,000 


1.966 


20 


39.3 


200 


50% 


960,000 


1.966 


15 


29.5 


128 


100% 


393,216 


0.805 


20 


16.1 


128 


50% 


393,216 


0.805 


15 


12.1 
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2. FFT Method 

The FFT can be used to calculate the correlations for a range of offsetst using 
C ftM .[m , ] = -l-2>*W V < +mT+f„ -i kq .} c][n\ 

2/V/ n 

= C tt [T, w tm , l] 

(3) 

C tt [T] = - J- £ * 4 [nW r + T] • c, [n] 

The length of the waveform s*ftj is L y + 255A/ C = 1 068 for L g = 48 and N c = 4. This 
is represented as N c waveforms of length LJN G + 255 = 267. 

One advantage of this approach is that elements can be stored for a range of 
offsets x so that calculations do not need to be performed when lags change. For 
delay spreads of about 4^s 32 samples need to be stored for each m\ 
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3. Using Code Correlations 

The C-matrix elements can be represented in terms of the underlying code 
correlations using 

^.["■^rJrl^K +m'T+t„ -tV] c>] 

= ^YJ,8l(n-p)N c +m'T+t lt -f hl .]c k [p]c l [n] 

2^, n p 

= ^-^g[mN c +T] c k [n-m] c,[n] 



z/v j n 

If the length of g[t] is L g = 48 and N c = 4, then the summation over m requires 
48/4 =12 macs for the real part and 12 macs for the imaginary part. The total ops 
is then 48 ops per element. (Compare with 2048 operations for the direct 
method.) Hence for the case where there are 200 virtual users and 20% of the C- 
matrix needs updating every 10 ms the required complexity is (960000 el)(48 
ops/el)(0.20)/(0.010 sec) = 921.6 MOPS. This is the required complexity to 
compute the C-matrix from the r-matrix. The cost of computing the r-matrix must 
also be considered. There is reason to hope that the r-matrix can be efficiently 
computed since the fundamental operation is a convolution of codes with 
elements constrained to be +/-1 

The r-matrix elements can be calculated using 

• theFFT 

• Modulo-2 arithmetic 

• Hardware XOR 

• Short-code generator(?) 
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4. Using Fundamental Correlations 

The waveform stf] can be decomposed into fundamental waveforms 
corresponding to 4-chip segments of the corresponding complex user codes. 
There are T = 256 such waveforms. Each of these can be correlated with 
another 256 possible 4-chip code segments. For each correlation there are about 
64 offsets that produce a non-zero correlation. Hence all correlation calculations 
can be represented in terms of 256(256)(64) = 4M fundamental complex 
correlations. The C-matrix elements are then 

Z/V j „ 



W m 'J =mT+T V T V 



63 63 t 3 

/=0 j=0 ZiV / /i=0 
63 63 

=XX c W T i 

1=0 ; =0 



* 

Using the above, each C-matrix element requires 64(64) = 4096 complex adds, 
or 8192 operations to calculate. (Compare with 2048 operations for the direct 
method.) 

Alternately, the calculations can be represented in terms of 4-chip real code 
segments and the corresponding waveforms. Hence all correlation calculations 
can be represented in terms of 16(16)(64) = 16K fundamental real correlations. 
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To: 



Wireless Communications Group 



From: 



J. H. Oates 



Subject: Calculation of C-matrix Elements 



Date: August 10, 2000 



1. Introduction 

The C-matrix elements are used to calculate the R-matrices, which are used by the MDF 
interference cancellation routine. Each C-matrix element can be calculated as a dot 
product between the /cth user's waveform and the Ah user's code stream, each offset by 
some multipath delay. For this method of calculation, each time a user's multipath profile 
changes all C-matrix elements associated with the changed profile must be recalculated. 
It is estimated that a user profile changes every 100 ms. This number, however, is based 
on very little data, and there is considerable risk that profiles may change more rapidly 
and compromise real-time operation. In addition, there is a large amount of overhead that 
must be performed before each dot product. In a recent benchmark the overhead 
consumed nearly all of the time allocated for the entire C-matrix update. Finally, if the C- 
matrix is calculated as described above then an entire processor must be allocated for 
this calculation. 

In view of the above observations a better approach is to pre-calculate the code 
correlations up-front when a user is added to the system. This calculation is performed 
over all possible code offsets and the calculations are stored in a large array, 
approximately 21 Mbytes in size. We will henceforth refer to this large matrix as the r 
matrix. The C-matrix elements are updated when a profile changes by extracting the 
appropriate elements from the T matrix and performing minor calculations. Since the T 
matrix elements are calculated for all code offsets the FFT can be effectively used to 
speed up the calculations. Since all code offsets are pre-calculated, there is no risk 
associated with rapidly changing multipath profiles. Under normal operating conditions 
when the number of users accessing system is constant the resources which must be 
allocated to extracting the C-matrix elements are minimal, and so extra resources may be 
allocated to the R-matrix calculation. 

Section 2 below outlines the calculation of the T matrix elements. It is shown that the T 
matrix elements are given in terms of a convolution. Section 3 shows how to calculate the 
T matrix elements using the FFT. Section 4 describes how the r-matrix elements might be 
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accessed from SDRAM. In section 5 various processing times are estimated, and a 
summary with conclusions is given in section 6. 

2. C-matrix Elements Expressed in Terms of Code Correlations 

The R-matrix elements are given in terms of the C-matrix elements as [1] 

(1) 

C ltn lm'} = + m ' T +T \ ~ T V3 • c » W 

Z/V / rt 

where C tkqq {m 9 ]\s a five-dimensional matrix of code correlations. Both / and /c range from 1 
to /C, where K v is the number of virtual users. The indices q and q' range from 1 to L, the 
number of multipath components, which is assumed to be equal to 4. The symbol period 
offset m' ranges from -1 to 1 . The total number of matrix elements to be calculated is then 
N c =3(K V L) 2 =3(800) 2 = 1.92A/ complex elements, or 3.84 MB if each element is a byte. 
This number is cut in half, however, due to the symmetries [2] 

W-m1 = -^C^[m1 * (2) 

The memory requirement is then 1 .92 MB. 

Referring to Equation (1) it is evident that each element of Ci kqq {m'] is a complex dot 
product between a code vector C\ and a waveform vector s w . The length of the code 
vector is 256. The waveform s k [t] is referred to as the signature waveform for the kth 
virtual user. This waveform is generated by passing the spread code sequence c k [n] 
through a pulse-shaping filter g[t] 

s k U] = I,8[t-pN c )c k [p] (3) 

where N - 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the signature waveform s k [t] includes the 
effects of filtering by the matched chip filter. Note that for spreading factors less than 256 
some of the chips c k [p] are zero. The length of the waveform vector is L g + 255N C , where 
L g is the length of the raised-cosine pulse vector g[t] and N c is the number of samples per 
chip. The values for these parameters as currently implemented are L g - 48 and N c = 4. 
The length of the waveform vector is then 1068, but for the dot product it is accessed at a 
stride of N c = 4, which gives effectively a length of 267. 

The raised-cosine pulse vector g[t] \s defined to be non-zero from t = -Lc/2 + 1±g/2, with 
g[0] = 1. With this definition the waveform s k [t] is non-zero from t = -Lg/2 + 1: L</2+ 255N C . 
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By combining Equations (1) and (3) the calculation of the C-matrix elements can be 

expressed directly in terms of the user code correlations. These correlations can be 

calculated up front and stored in SDRAM. The C-matrix elements expressed in terms of 

the code correlations Tujm] are 

= 7TT X X «K n " pWc + m ' T + \ - f v 1 ' c * [ P ] ■ c < [n] 
2N, „ p 



1 



■XX«[«A^+f]-c»[»-»»]-C|'M 



= I + T] — -£c; In] c k [n- m] (4) 
= 5>[mN c +T]r, t [m] 



Since the pulse shape vector g[n] is of length L g there are at most 2LJN C = 24 real macs 
to be performed to calculate each element Ci kqq {m']. (The factor of 2 is because the code 
correlations T, k [m] are complex.) Given x it is important to be able to efficiently calculate 
the range of values m for which g[mN c + x] \s non-zero. The minimum value of m is given 
by rrinvniNc + x = - LJ2 + 1. Now x is given by x = m'NN c + x /q - i kq >. If each x value is 
decomposed x /£7 = n /g A/ c + p /<7 , then /rw = ceil[ (-x - L</2 + 1)/A/ C ] = -mW - n, q + rv - 
V(2A/ C ) + ceil[ (p kq - p fq + tyNc ]. Now ceil[ (p* Q - p tq + 1)/A/ C ] will be either 0 or 1. It is 
convenient to set this to 0. In order that we do not access values outside the allocation for 
g[n] we must set g[n] = 0.0 for n = - L</2: - L</2 - (N c - 1). Note that of the N c * possible 
values for ceil[ (p* q — p lq + 1 )/N c ], all but one are 0. Hence we have 



= -m'N-n lg +n kq .-L g f(2N c ) (5) 



Note that L 9 must be divisible by 2N Cl and that L«/(2/V c ) should be a system constant. 

The maximum value of m is given by rrw*/\/ c + x = Z_</2. This gives /tw* = floor[ (- x + 
Lg/2)/W c ] = -m'N - n lq + n kq > + L</(2A/ C ) + floor[ (p* Q - p /<7 )/A/ c ]. Now floor[ (p kq >- p, q )/N c ] will 
be either -1 or 0. It is convenient to set this to 0. In order that we do not access values 
outside the allocation for g[n] we must set g[n] = 0.0 for n = -L</2 + 1: L</2 + A/ c . Note that 
of the Nc possible values for floor[ (p kq - p fq )IN c J, about half are 0. Hence we have 

= -*eN-n H +n v + LJ{2N € ) (6) 



These values are quickly calculable. 
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The r matrix is calculated in the next section for all values m by exploiting the FFT. Notice 
that the calculation of the C-matrix elements requires only a small subset of the r matrix 
elements. 

3. Using the FFT to Calculate the r-matrix Elements 

In the previous section it was shown that the r-matrix elements can be represented as a 
convolution. This fact is here exploited to calculate the r-matrix elements using the FFT 
convolution theorem. From Equation (4) the r-matrix elements are 



(7) 



where N = 256. Three streams are related by this equation. In order to apply the 
convolution theorem all three streams must be defined over the same time interval. The 
code streams c k [n] and d[n] are non-zero from n = 0:255. These intervals are based on 
the maximum spreading factor. For higher data-rate users the intervals over which the 
streams are non-zero are reduced further. We are concerned here, however, with the 
intervals derived from the highest spreading factor since these will be the largest intervals 
and we wish to define a common interval for all streams. The common interval allows the 
FFTs to be reused for all user interactions. 







H = 256 












cjn- 






IS= 128 





n = -256 



n = 0 



n = 255 



Figure 1. Interval for FFT calculation of the r matrix elements. Shown 
For the case where N k = 256 and N, = 128. 

The range of values m for which Ti k [m] is non-zero can be derived from the above 
intervals. The maximum value of m is limited by n-m >0, which gives 



255-m max =0 => =255 
and the minimum value is limited by n - m < 255 , which gives 

0-m„ ; =255 => m. =-255 



(8) 



O) 



To achieve a common interval for all three streams we select the interval m = -M/2: M/2 - 
J,M = 512. Where necessary the streams are zero-padded to fill up the interval. 
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Now, the DFT and IDFT of the streams are 

T- 

C,[r]= £c,[n] e-^ lu 

_ M 
2 



= 17 e^'" 



which gives 



] 2 t 
2N, M 



Z-4 
2 



1 y ycw-^'" Xc/tr'i-e-' 2 ""™ 

2N,M 2 ~j, _ m_ r . = _M 



r = - 

2 



2/V,M ~JL 



r -~l - 2 



2 



(10) 



(11) 



= — L_ YC k [r] C;[r]e-' 2mr,M 
2N,M H," 

r= "T 

Hence rymycan be calculated using the FFTs. Notice that the FFT gives values for all m. 
From the analysis above we know that many of these values will be zero for high data rate 
users. To conserve memory we wish to store only the non-zero values. The values of m 
for which T, k lm] is non-zero can be determined analytically. This subject is treated in the 
next section where the storage and retrieval of the r-matrix elements is considered. 



4. Storage and Retrieval of r-matrix Elements 

In order to efficiently store the r-matrix elements we must determine which values are 
non-zero. For high data rate users certain elements c,[n] are zero, even within the interval 
n = 0:N -1, N = 256. These zero values reduce the interval over which T lk [m] is non-zero. 
In order to determine the interval for non-zero values consider 

r tt [m] = -l-Xc;[n]cJ«-m] (12) 

2/V; „=<) 
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Define index j, for the /th virtual user such that cjn] is non-zero only over the interval 
n = : y f n, + N, -1 . Correspondingly, the vector cjn] is non-zero only over the interval 
n = j t N k : j k N k + N t -1 . Given these definitions Tm[m] can be rewritten as 



2N, 



The minimum value of m for which r lk [m] is non-zero is 

and the maximum value of m for which T^m] is non-zero is 

'n Bm2 =N l -\-j k N k +j l N l 

The total number of non-zero elements is then 



m mal =M, ma*2 ~ m min2 +1 



(13) 



(14) 



(15) 



(16) 



Table 1 below gives the number of bytes per l,k virtual-user pair based on 2 bytes per 
element - one byte for the real part and one byte for the imaginary part. 





UN k = 256 




ti§lll§§ 










: Nj = 256- 


1022 


766 


638 


574 


542 


526 


518 


128 


766 


510 


382 


318 


286 


270 


262 


64 


638 


382 


254 


190 


158 


142 


134 


mi32}.::,^ 


574 


318 


190 


126 


94 


78 


70 




542 


286 


158 


94 


62 


46 


38 


\&-V. 


526 


270 


142 


78 


46 


30 


22 




518 


262 


134 


70 


38 


22 


14 



Now we are in a position to determine the memory requirements for the r matrix for a 
given number of users at each spreading factor. Let there be K q virtual users at spreading 
factor N q a 2 8 -" , q = 0:6, where K q is the qth element of the vector K. Note that some 
elements of K may be zero. Let Table 1 above be stored in matrix M with elements M qq -. 
For example, M m = 1022, and M w = 766. The total memory required by the r matrix in 
bytes is then 



1 9=0 [ ?=0 J 



(17) 
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For example, for 200 virtual users at spreading factor N 0 = 256 we have K, = 200V 
which gives M bytes = KKd(Ko + 1)M«,= 1 00(201 )(1 022) =20.5 MB. 

For 1 0 384 Kbps users we have K, = Ktf* + Kefi* with K 0 = 10 and K 6 = 640. This gives 
AV* = VkKAKo + DM* + KKMb + + DM* = 5(11)(1022) + 10(640)(518) + 

320(641)(14) = 6.2 MB. 

Now consider addressing, storing and accessing the T-matrix data. For each pair (l,k), k 
>= / we have 1 complex value revalue for each value of m, where m ranges from rrw 
to nw. and the total number of non-zero elements is nw = "W? - nw + Hence for 
each pair (l,k), k >= I we have 2m, 0 , a / time-contiguous bytes. To access the data, create an 
array of structures: 

struct { 

int m_min2; 

int m_max2; 

int m_total; 

char * Glk; 
} G_info[N_ VU_MAX][ N_ VU_MAX]; 

The C-matrix data is then retrieved using something like: 

nw = G_info[l][k].m_min2 
nw? = GJnfo[IJ[k).m_max2 
N g = L</N c 

N1 = m"N - L/(2N C ) 
form' = 0:1 

forq = 0:L-1 

forq' = 0:L-1 

X = m'T + %i q - Xkq- 

nw = N1-n, q +n kq - 

ITImaxI = ITtminl + Ng 

rrimin = max[ m m)n( , m min 2 ] 

nimax = min[ m,nax» , /71max2 ] 

// rrimax >= rr\mn 

ITIspan — rrimax ~ rrimin + 1 
sum1=0.0; 

ptrl = &G_info[l][k].GIk[m min J 
ptr2 = &g[ rrirrin *N c +x] 
while rrispan > 0 

sum1 += ( *ptr1++ ) * ( *ptr2++ ) 

fflspan 

end 

C[m'][l][k][q][q'] = sum1 



end 



end 

end 

end 
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5. Estimated Processing Times 

The following processing times are estimated below: 

• Calculate r-matrix elements 

• Write to r-matrix elements to SDRAM 

• Pack r-matrix elements in SDRAM 

. Extract r-matrix elements/Form C-matrix from SDRAM 

• Write C-matrix elements to L2 cache 

• Pack C-matrix elements in L2 cache 

Processing times are calculated for two cases of interest. The first case is where K = 100 
users (K = 200 virtual users) are accessing the system and a voice user is added to the 
svstem Not all of these users are active. The control channels are always active, but the 
data channels have activity factor AF = 0.4. The mean number of active virtual users is 
then K + AF'K = 140. The standard deviation is o = jK AF (l-AF) = 4.90. With high 
probability, then, we have K v < 140 + 3a < 155 active users. 

The second case is the worst case scenario. This occurs when a number of voice users 
are accessing the system and a single 384 Kbps data user is added. A single 384 Kbps 
data user adds interference equal to (.25 + 0.1 25*1 00)/(.25 + 0.400*1) - 20 voice users. 
Hence the number of voice users accessing the system must be reduced to 
approximately K = 100 - 20 = 80 (K v = 160). The 3o number of active virtual users is then 
80 + (0 125)80 +3(3 0) = 99 active virtual users. The reason this scenario is stressful is 
that when a single 384 Kbps data user is added to the system, J + 1 =64+1 =65 virtual 
users are added to the system. 



Calculate r-matrix elements 

The r-matrix elements can be calculated in one of two ways. The first is using the SAL 
zconvx to perform the direct convolution. The second is using the SAL fft.zipx to perform 
the calculation via the FFT. The first method is preferable when the vector lengths are 
small SAL timing are given in Table 2. These timings are based on a 400 MHz PPC7400 
with 160MHz, 2MB L2 cache. The data is assumed resident in L1 cache. The 
performance loss for data L2 cache resident is not severe. 



Mtotaf 




stimihg;(ii|)l 


lisilipsi 


1024 


4 


19.33 


1.70 


1024 


8 


29.73 


2.20 


1024 


16 


50.55 


2.59 


1024 


32 


92.32 


2.84 


1024 


64 


176.53 


2.97 


1024 


128 


346.80 


3.47 



The time to perform a 512 complex FFT, with in-place calculation (fft_zipx), on a 400 MHz 
PPC7400 with 160MHz, 2MB L2 cache is 10.94 us for data L1 resident. Prior to 
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performing the (final) FFT we must perform a complex vector multiply of length 512. The 
SAL timings for zvmulx are given in Table 3. 



Table 3. SAL timings and G FLOPS forzvmulx function 



vij\ Length ' 


Location v| 


Timing (us) 


v GFLDRSB 


1024 


L1 


4.46 


1.38 


1024 


L2 


24.27 


0.253 


1024 


DRAM 


61.49 


0.100 



We will also be interested in the time to move data. Hence the SAL timings for zvmovx are 
given in Table 4. 



Table 4. SA 


L timings for zvmovx function 


felyili|ngtRii# 


; ^Location: • : 


S$Timing!i(MS)R 


1024 


L1 


1.20 


1024 


L2 


15.34 


1024 


DRAM 


30.05 



Figure 2 shows the elements that must be calculated (in gray) when a physical user is 
added to the system. When a physical user is added to the system there are 1 + J virtual 
users added to the systems: that is, 1 control channel + J = 256/ SF data channels. The 
number K v represents the number of virtual users that are using the system to begin with. 



Ik 



Columns k 



1 J 



Rows 1 



K v 



Figure 2. Elements that must be calculated (in gray) 
when a physical user is added to the system. 



Hence there are (K v + 1) elements added due to the control channel, and J(K V + 1) + J(J + 
1)/2 elements added due to the data channels. The total number of elements added is 
then (J+1)[/C+1 + J/2]. 
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Page ^^6^ ^ ^ ^ ^ perform the calculations. The total number of FFTs to 
perform is (J + 1) + (J + + 1 + J/2]. The first term represents the FFTs to transform 
crfn], and the second term represents the (J + + 1 + J/2] inverse FFTs of 
FFT{c k [n]}*FFT{c,'[n}\. The time to perform the complex 512 FFTs is 10.94 \is, whereas 
the time to perform the complex vector multiply and the complex 512 FFT is 24.27/2 + 
10.94 = 23.08 lis. 

For the first scenario there are K v = 200 virtual users accessing the system and a voice 
user is added to the system (J = 1). The total time to add the voice user is then (1 + 
1)(10.94 ^is) + (1 + 1)[200 + 1 + 1/2](23.08 = 9.3 ms. 

For the second scenario there are K v = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J = 64). The total time to add the 384 Kbps user is 
then (64 + 1)(10.94 us) + (64 + 1)[160 + 1 + 64/2](23.08 us) = 290 ms! This number is 
way too big and hence for high data-rate users, at least, the r-matrix elements must be 
calculated via convolutions. 

The direct method to calculate the r-matrix elements is to use the SAL zconvx function to 
perform the convolution 

] *H 

T rt [m] = — [n + j,N, ] • c k [n + j,N, - m] 

2N, n=0 (18) 

= — ^[n* j k N k +m] c k [n + j k N k ] 

For each value of m there are N min = min{A/,, N k } complex macs (cmacs). Each cmac 
requires 8 flops, and there are m to tai = N f + N k - 1 m-values to calculate. Hence the total 
number of flops is 8A/ m/n (/V, + N k - 1). For what follows we assume the convolution 
calculation is performed at 1.50 GOPs = 1500 ops/^is. The calculation time to perform the 
convolutions is presented in Table 5. 



Table 5. Calculation time(iis) to perform the r-matrix convolutions. 





?Nk = 256 






' : ::32>m 




• S: tgl) 




N, = 256 


697.69 


261.46 


108.89 


48.98 


23.13 


11.22 


5.53 




261 .46 


174.08 


65.19 


27.14 


12.20 


5.76 


2.79 I 


• 64 


108.89 


65.19 


43.35 


16.21 


6.74 


3.03 


1.43 




48.98 


27.14 


16.21 


10.75 


4.01 


1.66 


0.75 




23.13 


12.20 


6.74 


4.01 


2.65 


0.98 


0.41 




11.22 


5.76 


3.03 


1.66 


0.98 


0.64 


0.23 




5.53 


2.79 


1.43 


0.75 


0.41 


0.23 


0.15 



The shaded cells indicate times faster than the 23.08 \is FFT time. Equation 17 gives the 
size of the r-matrix in bytes. Similarly, the total time to calculate the r-matrix is 
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4i{v-+iw*l (19) 



where r w are the elements in Table 5. Now suppose K' = K+ A, where A q = JxV + Jy$ qy , 
where x and y are not equal. Then 

AT r =T r (K')-T r (K) 

i ^ r 1 ( 20 ) 

= 1 , + i)T„ +-J v (J y + i)r„ + ./,./ y ^ + IX vX, + V* ) 

2 2 ? =o 

For the first scenario there are K v = 200 virtual users accessing the system and a voice 
user is added to the system (J= 1). Hence we have K q = K&p (SF = 256), K v = 200, J x - J 
= 2 and J y = 0. The total time is then 

y 2 J(J + 1) Too + JK v Too = (0.5)(2)(3)(0.70 ms) + (2)(200)(0.70 ms) = 283 ms 

This number is way too big and hence for voice users, at least, the r-matrix elements 
must be calculated via FFTs. 

For the second scenario there are K v = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J = 64). Hence we have K q = K v 8 q0 (SF = 256), K v 
= 160, J x = 1 (control) and J y = J= 64 (data). The total time is then 

{K v + 1 ) Too + J(K V + 1 ) Toe + {J + 1 ){Jf2) Tee 

= (161)(697.7 us) + (64)(161)(5.53 us) + (65)(32)(0.15 us) = 

1 12.33 ms + 56.98 ms + 0.31 ms = 169.62 ms 

Since Too = 697 7 us is so large, these calculations should be performed using the FFT, 
which costs 23.08 us per convolution. We also have 1 FFTs to compute FFTfcM) for 
the single control channel. This costs an additional 10.94 us. The total time, then, to add 
the 384 Kbps user is 

10.94 lis + (161)(23.08) us + (64)(161)(5.53) us + (65)(32)(0.15) us = 
= 61 .02 ms 
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Write to r-matrix elements to SDRAM 

The numbers in Table 1 represent the 2nw bytes per r-matrix element. Recall that the 
size of the r-matrix in bytes from Equation 17 is 



Now suppose K'=K+A, where A q = + JySqy, where xand y are not equal. Then 



Consider the first scenario where K q = 2005,0 (SF = 256) and that a single voice user is 
added to the system: J x = 2 (data plus control), and J y = 0. The total number of bytes is 
then 0.5(2)(3)(1022) + 200(2)(1022) = 0.412 MB. The SDRAM write speed is 133MHz*8 
bytes * 0.5 = 532 MB/s. The time to write to SDRAM is then 0.774 ms. 

Now for the second scenario K q = 160V (SF = 256), and that a single 384 Kbps (SF = 4) 
user is added to the system: J x = 1 (control) and J y = 64 (data). The total number of bytes 
is then 0.5(1)(2)(1022) + 0.5(64)(65)(14) + 160(1(1022) + 64(518)} = 5.498 MB. The 
SDRAM write speed is 133MHz*8 bytes * 0.5 = 532 MB/s. The time to write to SDRAM is 
then 10.33 ms. 



Pack r-matrix elements in SDRAM 

The maximum total size of the r-matrix is 20.5 MB. Suppose that in order to pack the 
matrix every element must be moved. This is the worst case. The SDRAM speed is 
133MHz*8 bytes * 0.5 = 532 MB/s. The move time is then 2(20.5 MB)/(532 MB/s) = 77.1 
ms. If the r-matrix is divided over three processors this time is reduced by a factor of 3. 
The packing can be done incrementally, so there is no strict time limit. 




(21) 



= -[K diag{M) + K T M k] 



AM„=M b {K')-M b (K) 

= tj x (J, + \)M xx +±J y (J y + \)M yy + J,J y M ry 



(22) 



q=0 
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Extract T-matrix elements/Form C-matrix from SDRAM 
Recall that the C-matrix data is retrieved using something like: 

rrw = G_info[l][k].m_min2 
nw? = GJnfo[l][k).m_max2 

Ng = Lg/Nc 

N1 = m'*N - Lg\2N c ) 
form' = 0:1 

forq = 0:L -1 

forq' = 0:L-1 

x = m'T+ xi q - x kq - 
nw = N1 -niq + n kq - 

ITImaxI = m m in1 + Ng 

rrimin = max[ m^ , rrw* ] 

mmax = m'in[ /7Wf . T>max2 ] 
if mmax >= m rrin 

mspan = mmax ~ ^min + ' 

sum1=0.0; 

ptrl = AGJafofflklGlklmmin] 
ptr2 = &glm nin 'N c + i] 
wh ile m^ > 0 

sum1 += ( *ptr1++ ) * ( *ptr2++ ) 

mspan 

end 

C[m'][l]Mlq][q'] = sum1 

end 

end 

end 

end 



Time to extract elements when a new user is added to the system 

We calculated above the time to calculate the r-matrix elements when a new user is 

added to the system. Here we consider the time to extract the corresponding C-matrix 

elements. 

Notice that Glk[m] are accessed from SDRAM. Values will almost certainly not be in either 
L1 or L2 cache. For a given (l,k) pair, however, the spread in t will for most cases be less 
than 8 us (i.e for a 4 us delay spread), which equates to (8 us)(4 chips/us)(2 bytes/ch.p = 
' 64 bytes or 2 cache lines. Since data must be read in for two values of m a total ot 4 
cache lines must be read. This will require 16 clocks, or about 16/133 = 0.12 us^ 
However, measured results for zvmovx indicate that accesses to SDRAM are performed 
at about 50% efficiency so that the required time is about 0.24 us. 

Now suppose, for example, user / = x is added to the system. We must fetch the elements 
C[m'l[x][k][q][qVor all m\ k, q and q\ As indicated above, all the m\ q and q va ues w.ll 
be contained typically in 4 cache lines. Hence if there are K v virtual users we mus tread m 
4/C cache lines, or 32K„ clocks, where we have doubled the clocks to account for the 50^/o 
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Pa9e N e°ffiSncy. In general J + 1 virtual users are added to the system at a time. This will 
require 32K^J+ 1) clocks. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to read in the C-matrix elements will be 32(155)(1 + 1) 
clocks/(133 clocks/us) = 74.6 us. The industry standard hold time t h for a voice call is 140 
s The average rate X of users added to the system can be determined from lt h = K, 
where K is the average number of users using the system. For K = 1 00 users we have X = 
100/140 s = 1 users added per 1 .4 s. 

For the case where we have 99 active virtual users and a 384 Kbps user is added to the 
system the time required to read in the C-matrix elements will be 32(99)(64 + 1) 
clocks/(133 clocks/us) = 1.55 ms. However data users presumably will be added to the 
system more infrequently than voice users. 



Time to extract element s when tw changes 

Now suppose, for example, user / = x lag q = y changes. Then we must fetch the e laments 
Ctm'1M[k][y][q']tor all m\ k and q'. All the q' values will be contained typically in 1 cache 
line Hence we must read in 2(/Q(1) = 2K V cache lines, or ^6K V clocks, where we have 
doubled the clocks to account for the 50% efficiency. In general, when a lag changes 
there are J + 1 virtual users for which the C-matrix elements must be updated. This will 
require 16Kv(J+ 1) clocks. 

For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to read in the C-matrix elements will be 16(155)(1 + 1) 
clocks/(133 clocks/us) = 37.3 us. Recall that for high mobility users such changes should 
occur at a rate of about 1 per 100 ms per physical user. This equates to about once per 
1 .33 ms processing interval if there are 100 physical users so that approximately 37.3 us 
will be required every 1.33 ms. 

For the case where we have 99 virtual users and a 384 Kbps data user's profile (one lag) 
changes, the time required to read in the C-matrix elements will be 16(99)(64 + 1) 
clocks/(133 clocks/us) = 0.774 ms. However data users will have lower mobility and hence 
such changes should occur infrequently. 



Write C-matrix elements to L2 cache 

Time to write elements when a new user is ad ded to the system 

Consider again the case where user / = x is added to the system. We must write elements 

crm'iMMtoHq'] f° r a " ">'. k > <* and * |f there are K * active virtual users we must MlXe , 

4K V L 2 bytes where we have doubled the bytes since the elements are complex. In general 
j +1 virtual'users are added to the system at a time. This will require 4K V L (J + 1 ) bytes to 
be written to L2 cache. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to write the C-matrix elements will be 4(155)(16)(1 + 1) 
bytes/(2128 bytes/us) = 9.3 us. 

14 
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For the second case where we have 99 active virtual users and a 384 Kbps user is added 
to the system, the time required to write the C-matrix elements will be 4(99)(16)(64 + 1) 
bytes/(2128 bytes/us) = 193.5 \is. Recall, however, that data users presumably will be 
added to the system more infrequently than voice users. 



Time to extract elements when changes 

Now suppose, for example, user / = x lag q = y changes. We must write elements 
C[m'][x][kJ[q][q'] ior all m', k and q'. If there are K active virtual users we must write AK V L 
bytes where we have doubled the bytes since the elements are complex. In general J+ 1 
virtual users are added to the system at a time. This will require 4K„L(J + 1) bytes to be 
written to L2 cache. 

For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to write the C-matrix elements will be 4(155)(4)(1 + 1) 
bytes/(2128 bytes/us) = 2.33 us. 

For the second case where we have 99 active virtual users and a 384 Kbps data user's 
profile (one lag) changes, the time required to write the C-matrix elements will be 
4(99)(4)(64 + 1)bytes/(2128 bytes/us) = 48.4 us. However data users will have lower 
mobility and hence such changes should occur infrequently. 



Pack C-matrix elements in L2 cache 

The C-matrix elements will need to be packed in memory every time a new user is added 
to or deleted from the system and every time a new user becomes active or inactive. The 
size of the C-matrix is 2(3/2)(K„L/ = 3(K v Lf bytes, however, divided over three 
processors this becomes (Klf bytes per processor. Assume that the entire matrix must 
be moved. The move is within L2 cache. Hence the total move time is 2(K v Lf bytes/(2128 
bytes/^s), where the factor of 2 accounts for read and write. 

For the first case where we have 155 active virtual users the time required to move the C- 
matrix elements will be 2(1 55M) 2 bytes/(2128 bytes/^s) = 0.361 ms. 

For the first case where we have 99 active virtual users the time required to move the C- 
matrix elements will be 2(99M) 2 bytes/(21 28 bytes/^s) = 0.1 47 ms. 

These events will occur typically once every 10 ms, that is, once per frame. 
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6. Summary and Conclusions 

In summary, we have determined 

• The r-matrix will require approximately 20.5 MB of SDRAM 
. To efficiently calculate the r-matrix elements will require both direct convolution 
FFT calculations 

. To pack the r matrix in SDRAM will require approximately 77.1 ms 



The following processing times are estimated: 



^Estimated Processing Times 


Case 1 

(voice user added) 


Case 2 v; 
(384 Kbps user added) 


Calculate r-matrix elements 


9.3 ms 


61.0 ms 


Write r-matrix elements to SDRAM 


0.77 ms 


10.3 ms 


Extract C-matrix elements when 
New user added 
Multipath profile changes 


75 us 
37 \xs 


1.6 ms 
0.77 ms 


Write C-matrix elements to L2 when 
New user added 
Multipath profile changes 


9.3 lis 
2.3 tis 


194 ^is 
48 \is 


Pack C-matrix elements in 12 cache 


361 lis 


147 \is 



These times are based on a single but devoted G4 allocated to perform the calculations. 
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The C-matrix elements can be represented in terms of the underlying code 
correlations using 



ZN i n 

XX*K« " p)K + m'T +.f„ -f „.] c k [p] c,[n] 



2N, 



ZN ^ „ m 

= X^ c +T]-i-£c;[n]cJn-m] (D 

™ ZN i 



= £«[mtf f +T]-r ft [m] 



T tt [m] = -j- X c, [n] ■ c k [n - m] 

ZN i n 

The r-matrix represents the correlation between the complex user codes. The 
complex code for user / is assumed to be infinite in length, but with only N t non- 
zero values. The non-zero values are constrained to be ±1 ± j . The r-matrix can 
represented in terms of the real and imaginary parts of the complex user codes 
becomes 
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r tt [m] s -1- £ C ;(n) c k [n - m] 

= — T fc>] - yc/ [«]}• fcf [» - m] + [n - m]} 

= — X • I" " "»] + c,' ("I • c[ [n - m] 

+ yc* [n] • c,' [« - m] - >c/ [n] • ef [n - m]} 

= r B " [m] + r,>] + [ml-r^lm]} 



where 



(2) 



(3) 



rf[m] S ^2c ( "[n]c>-m] 
2/V j „ 

2iV / „ 

r tt w [m] S -i-Xc,*[n]-c t '[n-m] 
r»[m] = T^-I^ , [«l^[n-'n] 

Consider any one of the above real correlations, denoted 

^m^^cf[nYcl[n-m\ (4) 
IN j „ 

where X and Y can be either R or /. Since the elements of the codes are now 
constrained to be + 1 or 0, we can define 

c?[n\^-2y?[n\\m?[n\ (5) 

where yf [n] and mf [n] are both either zero or one. The sequence mf [n] is a 
mask used to account for values of cf [n] that are zero. With these definitions 
Equation (4) becomes 
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= rirXfr- 2 ?* W> (i - iy Y M - «])■ i«] • **' [» - ™i 

2/v, „ 

= — y t - 2(y* [n] ®f k ln- m])\ mf [n] ■ m r k [n - m) 
= ^~{X m *["l m Y k [n-m] 

- 2X(tf [«] © r»' [» - ™])- «* M m l [»-«]} 

~kf[m]-2^f[m]} 



(6) 



W/f [«] s X' m * ' m * [n " m] 

n 

where 0 indicates modulo-2 addition (or logical XOR). 

The hardware to perform these operations is shown in Figures 1 - 3. Figure 1 
shows the initial register configuration after loading code and mask sequences. 
The boolean functions are shown in Figure 2, and Figure 3 shows the register 
configuration after a number of shifts. 



Load mask & code for 
user I here (256 chips) 



I xi ad mask & code for 
user k here (256 chips) 



Mask I 
Codel 



Initialize w 
all zeros 



ll!'!!!!!!!!! 1 !!!!! 1 ! 1 ! 1 



Sum 



— Boolean operations 

Output 



Figure 1. Initial register configuration after loading code and mask sequences. 
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Maskk 
Codek 




Figure 2. Boolean functions. 



Shift mask & code right 
1 chip at a time 



Mask! 
code I 



□□□□□□□□□ummaQQQHB 



Maskk 
Codek 



l*mfoMgl*lgH«telBlcl«U 



lablcblabglBlaMBlBl 



T 



Load zeros in from left 



Perform a total of 512 shifts, 
shifting mask k and code k 
out of registers at right. 



VIHgilHl 



ii 



wmmmmmsm 



Sum 



Output 



Figure 3. Register configuration after a number of shifts. 



M,f [m] 



and 



N?[m]. The 



The above hardware calculates the functions m lk imj anu ,<, lk 
remaining calculations to form r,f[m] and subsequently T lk [m] can be 
performed in software. Note that the four functions r,f [m] corrsponding to X, V = 
R, I which are components of T lk [m) can be calculated in parallel. For K v = 200 
virtual users, and assuming that 10% of all (/, K) pairs must be calculated in 2 ms, 
then for real-time operation we must calculate 0.10(200) 2 = 4000 r ik [m] elements 
(all shifts) in 2 ms, or about 2M elements (all shifts) per second. For K v = 128 
virtual users the requirement drops to 0.81 92M elements (all shifts) per second. 

In what has been presented \her lk [m] elements are calculated for all 512 shifts. 
Not all of these shifts are needed, so it is possible to reduce the number of 
calculations per r lk [m] elements. The cost is increased design complexity. 
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Makefile 



. SUFFIXES : .a .c .mac .o .S 

ARCH = ppc7400 
MUDLIB = mudlib.a 

##flCFLAGS = -Ot -t ${ARCH} -I. -DCOMPILE_C 

CFLAGS = -Ot -t ${ARCH) -I. 

AS FLAGS = -t ${ARCH} -DBUILD_MAX -I. 



# 

# Make object files 
# 

.CO: 

ccmc ${CFLAGS} -o $*.o -c $*.c 



# Make ASM 

n 

.mac .0: 

rm -f $*.S 

cp $* .mac $* .S 

ccmc $ {AS FLAGS} ~o $*.o -c $*.S 
rm -f $*.S 



OBJS = \ 

get sizes. o \ 
get sizes v.o \ 
reformat corr.o \ 
rmats.o \ 
reformat_r.o \ 
mpic.o \ 
gen x row.o \ 
gen r sums.o \ 
gen r sums2.o \ 
gen r matrices. o \ 
mtrans32 8bit.o \ 
mtriangle 8bit.o \ 
dotpr3 8bit.o \ 
dotpr6 8bit.o \ 
dotpr9 8bit.o \ 
sve3 8bit.o \ 
fixed cdotpr.o \ 
zdotpr4 vmx.o \ 
zdotpr_ymx .o 

${MUDLIB}: Makefile ${0BJS} 
armc -c $@ ${OBJS} 



# Cleanup 
# 

clean: 

rm -f ${0BJS} *.S ${MUDLIB} 

get sizes. o: mudlib.h get_sizes.c 
reformat_corr.o: mudlib.h ref ormat_corr . c 

rmats.o: mudlib.h rmats.c \ „„™o ™^ 

gen x row. mac gen r_sums.mac gen_r_sums2 .mac 

gen r matrices. mac 

reformat r.o: mudlib.h reformat_r.c 

mpic.o: mudlib.h mpic.c \^ ^ ^ dotpr6 _ 8bit mac dotpr9 _ 8 bit .mac 

sve3_8bit .mac 

dotpr3 8bit.o: dotpr3 8bit.mac salppc.inc 
dotpr6_8bit .0: dotpr6_8bit .mac salppc.inc 



1 
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Makefile 

dotpr9 8bit.o: dotpr9 8bit.mac salppc.inc 

sve3 8bit.o: sve3 8bit.mac salppc.inc 

fixed cdotpr.o: zdotpr4 vmx.mac salppc.inc 

zdotpr4 vmx.o: zdotpr4_vmx.mac zdotpr4_vmx . k salppc.inc 



2 
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^include "mudlib.h" 

tfdefine DO CALC STATS 0 

#define DO TRUNCATE 1 

define DO SATURATE 1 

ttdefine DO_SQUELCH 0 

tfdefine SQUELCH THRESH 1.0 
fl de f i ne TRUNCATE_B I AS 0.0 

#if DO TRUNCATE 

tfdefine SATURATE_THRESH (128.0 + TRUNCATE_B I AS ) 
#else 

ttdefine SATURATE_THRESH 127.5 
#endif 

#define SATURATE ( f ) \ 

' if ( (f) >= SATURATE THRESH ) f - (SATURATE THRESH - 1.0); \ 
else if ( (f) < -SATURATE_THRESH ) f = - SATURATE_THRESH ; \ 

} 

#if DOJTRUNCATE 

#define BF8 FIX( f ) ( (BF8) (FABS(f) <= TRUNCATE BIAS) ? 0 :'\ 

( ( (f ) > 0.0) ? ( (f ) - TRUNCATE BIAS) : \ 
((f) + TRUNCATE_BIAS) ) ) 

#define BF8_FIX( f ) ( (BF8) (f ) ) 
#else 

#define BF8_FIX( f ) ( (BF8) ( ( ( ( (f ) < 0.0)) && ((f) == (f loat) ( (int) ( f ) ) ) ) ? 



\ 

#endif 



((f) + 1.0) : (f))) 



#else 

tfdefine BF8_FIX( f ) ( (BF8) ( ( (f ) >= 0.0) ? (<f)+0.5) : (<f)-0.5))) 
tfendif 

#define UPDATE MAX ( f, max ) \ 

if ( FABS ( f ) > max ) max = FABS ( f ); 

tfdefine uchar unsigned char 
^define ushort unsigned short 
#define ulong unsigned long 

#if D0_CALC STATS 

static float max_R_value; 

#endif 

void gen X row ( 

COMPLEX BF16 *mpathl bf , 
COMPLEX BF16 *mpath2_bf, 
COMPLEX BF16 *X_bf, 
int phys index, 
int tot_phys_users 

) ; 

void gen R sums ( 

COMPLEX BF16 *X bf, 
COMPLEX BF8 *corr_bf, 
uchar *ptov map, 
BF32 *R sums, 
int num_phys_users 

) ; 

void gen_R_sums2 ( 

1 
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2/23/2001 



) ; 



COMPLEX BF16 *X bf, 
COMPLEX BF8 *corra bf , 
COMPLEX BF8 *corrb_bf, 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int nurn_phys_users 



void 



gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf , 
BF8 *scale row bf, 
int num_virt_users 

) ; 



void 



mudlib gen R ( 

COMPLEX BF16 
- COMPLEX BF16 

COMPLEX BF8 

COMPLEX BF8 

uchar 

float 

float 

float 

char 



*mpathl bf, 
*mpath2 bf, 
*corr 0 bf, 
*corr 1 bf, 



*ptov map, 
*bf scalep, 
*inv scalep, 
♦scalep, 
*L1 cachep, 
BF8 *R0 upper bf, 
BF8 *R0 lower bf, 
BF8 *R1 trans_bf, 
BF8 *Rlm bf , 
int tot phys users, 
int tot virt users, 
int start phys user, 
int start virt user, 
int end phys user, 
int end virt user 



adjusted for starting physical user */ 
adjusted for starting physical user */ 
no more than 256 virts. per phys */ 
scalar: always a power of 2 */ 
start at O'th physical user */ 
start at O'th physical user */ 
temp: 32K bytes, 32-byte aligned */ 



/* zero-based starting row (inclusive) */ 

/* relative to start phys user */ 

/* zero-based ending row (inclusive) */ 

/* relative to end_phys_user */ 



) 



COMPLEX BF16 *X bf; 
BF32 *R sumsO, *R sumsl; 
uchar *R0_ptov_map; 

int bump, byte offset, i, iv, last virt user; 

int R0_align, RO_skipped_virt_users, R0_tcols, RO_virt_users, Rl_tcols; 

#if DO CALC STATS 

max R_value = 0.0; 
#endif 

X_bf = (COMPLEX_BF16 *)Ll_cachep; 

byte offset = tot phys users * NUM FINGERS SQUARED * sizeof (COMPLEX BF16) ; 
R sumsO = (BF32 *)(((ulong)X bf + byte_offset + R_MATRIX_ALIGN_MASK) & 
~ R_MATRI X_AL I GN_MAS K ) ; 

byte offset = tot virt users * sizeof (BF32) ,- 

R sumsl = (BF32 *)(((ulong)R sumsO + byte_offset + R_MATRIX_ALIGN_MASK) & 
~ R_MATR I X_AL I GN_MA S K ) ; 

R0 t>tov map = (uchar *) (((ulong)R sumsl + byte offset + 

R MATRIX ALIGN MASK) & ~R MATR I X_AL I GN__MAS K ) ; 



Rl tcols = (tot_virt__users + R_MATRIX_ALIGN_MASK) & ~ R_MATR I X_AL I GN__MAS K ; 



2 
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R0 virt_users = 0; 

for ( i - start_phys user; i < tot phys_users; i++ ) { 
R0 virt users += (int)ptov map[i); 
R0_ptov_map [ i ] = ptov_map [ i ] ; 

R0 ptov map [start phys user) -= start virt user,- 

R0 skipped virt users = tot virt users - R0_virt_users + start_virt_user ; 
R0_virt_users -= (start_virt_user + 1) ; 

--inv_scalep; /* predecrement to allow for common indexing */ 

for ( i = start_phys_user; i <= end_phys_user ; i++ ) { 

gen X row ( 
mpathl bf, 
mpath2_bf , 
X bf , 
i, 

totjphys_users 

) ; 

--R0_ptov_map [i] ; /* excludes R0 diagonal */ 

last_virt_user = (i < end_phys_user) ? {(int)ptov map[i] - 1) : 

end_virt_user; 

for ( iv = start_virt_user; (iv + 1) <= last_virt_user ; iv += 2 ) { 

gen R sums2 ( 

X bf + (i * NUM_FINGERS_SQUARED) , 
corr 0 bf, 

corr 0 bf + ( (R0_virt_users - 1) * NUM_FINGERS_SQUARED) , 
R0 ptov_map + i, 

R sumsO + (R0 skipped virt users + 1) , 
R sumsl + (R0 skipped_virt_users + 1) , 
tot_phys_users - i 

) ; 

R0 tcols = Rl tcols - (R0 skipped_virt users & ~R MATR I X_ALI GN_MAS K ) ; 
R0_align = (RO_skipped_virt_users & R_MATR I X_AL I GN_MAS K ) + 1; 

gen R matrices ( 

R sumsO + (RO_skipped_virt_users + 1) , 
bf scalep, 

inv scalep + (R0 skipped virt users + 1), 
scalep + {R0 skipped virt_users + 1) , 
R0 lower bf + R0 align, 
RO upper bf + R0_align, 
R0__virt_users 

) ; 

R0_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

R0 lower bf += R0 tcols; 
R0_upper_bf += R0_tcols; 

R0_tcols = Rl_tcols - ( (R0 skipped virt users + 1) & 

~R MATRIX ALIGN MASK) ; 
R0_align = ( (R0_skipped_virt_users + 1) & R_MATR I X_AL I GN_MAS K ) + 1; 

gen R matrices ( 

R sumsl + {R0_skipped_virt_users + 2) , 
bf scalep, 

inv scalep + (R0 skipped virt users + 2) , 
scalep + (R0 skipped virt_users + 2) , 
R0_lower_bf + R0_align, 
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R0 upper bf + RO align, 
RO_virt_users - 1 

) ; 

RO_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

RO lower bf += RO tcols; 
RO_upper_bf += R0_tcols; 

/* 

* create ptov map[i] number of 32-element dot products involving 

* X_bf[i] and corr_l_bf [i] [ j ] where 0 < j < ptov_map[i] 
*/ 

gen R sums 2 ( 
X bf, 

corr 1 bf, 

corr 1 bf + (tot_virt_users * NUM_FINGERS_SQUARED) , 

ptov map, 

R sumsO, 

R sumsl, 

tot_phys_users 

) ; 

/* 

* scale the results and create two output rows (1 per matrix) 
*/ 

gen R matrices ( 
R sumsO, 
bf scalep, 

inv scalep + (RO_skipped_virt_users + 1) , 

scalep, 

Rl trans_bf, 

Rim bf , 

tot_virt_users 

) ; 



Rl trans bf += Rl tcols; 
Rlm_bf += Rl_tcols; 

gen R matrices ( 
R sumsl, 
bf scalep, 

inv scalep + (RO_skipped_virt_users + 2) , 

scalep, 

Rl trans_bf, 

Rim bf, 

tot_virt_users 

) ; 



Rl trans bf += Rl tcols; 
Rlm_bf += Rl_tcols; 

corr 0 bf += ( ( (2 * RO virt users) - 1) * NUM FINGERS SQUARED); 
corr 1 bf += ( (2 * tot_virt_users) * NUM_FINGERS_SQUARED) ; 
RO ptov mapti] -= 2; 
RO virt users -= 2; 
RO_skipped_virt_users += 2; 



if { iv <= last_virt_user ) { 

bump = RO ptov_map [ i ] ? 0 : 1 ; 
gen R sums ( 

X bf + ((i + bump) * NUM_FINGERS_SQUARED) , 

corr 0 bf, 

RO ptov_map + i + bump, 

R_sums0 + (RO_skipped_virt_users + 1) , 



4 
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tot_phys_users - i - bump 

); 

R0 tcols = Rl tcols - (R0 skipped_virt users & ~R MATRIX_ALIGN_MASK) ; 
R0_align = (RO_skipped_virt_users & R_MATR I X_AL I GN_MAS K ) + 1; 

gen R matrices ( 

R sumsO + (RO_skipped_virt_users + 1) , 
bf scalep, 

inv scalep + (R0 skipped virt users + 1) , 
scalep + (R0 skipped virt_users + 1), 
R0 lower bf + R0 align, 
RO upper bf + R0_align, 
RO_virt_users 

) ; 

R0_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

RO lower bf += RO tcols; 
RO_upper_bf += R0_tcols; 



* create ptov map[i] number of 32-element dot products involving 

* X_bf[i] and corr_l_bf [i] [ j ] where 0 < j < ptov_map[i] 
*/ 

gen R sums ( 
X bf, 

corr 1 bf, 
ptov map, 
R sumsO, 
tot_phys_users 

); 



* scale the results and create two output rows (1 per matrix) 
*/ 

gen R matrices ( 
R sumsO, 
bf scalep, 

inv scalep + (RO_skipped_virt_users + l) , 

scalep, 

Rl trans_bf, 

Rim bf, 

tot_virt_users 

) ; 

Rl trans bf += Rl tcols; 
Rlm_bf += Rl_tCOls; 

corr 0 bf += (RO virt users * NUM FINGERS SQUARED) ; 
corr 1 bf += ( tot_virt_users * NUM_FINGERS_SQUARED) ; 
RO ptov map [i] -= 1 ; 
RO virt users -= 1; 
RO_skipped_virt_users += 1; 

start virt user =0; /* for all subsequent passes */ 

} " " 



#if DO CALC STATS 

printf( "max R value = %f\n" , max_R_value ); 

if ( max_R_value > 127.0 ) 

printf ( »*♦*** OVERFLOW *****\ n " ); 
#endif 
} 

#if COMPILE_C 



5 
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void gen X row ( 

COMPLEX BF16 *mpathl bf, 
COMPLEX BF16 *mpath2_bf, 
COMPLEX BF16 *X_bf, 
int phys index, 
int tot_phys_users 



COMPLEX BF16 *in mpathlp, *in mpath2p; 

COMPLEX_BF16 *out_mpathlp / *out_mpath2p; 

int i, j, q, ql ; 

BF32 sir, sli, s2r, s2i; 

BF32 air, ali, a2r, a2i; 

BF32 cr, ci; 



out mpathlp = mpathl bf + (phys index 
out_mpath2p = mpath2_bf + (phys_index 

for ( i = 0; i < tot_phys_users ; i++ ) { 



NUM FINGERS) ; 
NUM FINGERS) ; 



in mpathlp = mpathl bf + {i 
in_mpath2p = mpath2_bf + (i 



NUM FINGERS) ; 
NUM FINGERS) ; 



/* 4 complex values */ 
/* 4 complex values */ 



j = 0; 
for ( ql 



0; ql < NUM_FINGERS; ql++ ) { 



sir = (BF32)out mpathlp [ql] . real ; 

sli = (BF32)out mpathlp [ql] . imag ; 

s2r = (BF32)out mpath2p[ql] .real; 

s2i ■ (BF32) out_mpath2p [ql] . imag; 

for { q = 0; q < NUM_FINGERS; q++ ) { 

air = <BF32)in mpathlp [q] . real ; 

ali = (BF32)in mpathlp [q] . imag; 

a2r = (BF32)in mpath2p [q] . real ; 

a2i = (BF32) in_mpath2p [q] . imag; 

cr = (air * sir) + (ali * sli) ; 
ci - (air * sli) - (ali * sir) ; 
cr += (a2r * s2r) + (a2i * s2i) ; 
ci += (a2r * s2i) - (a2i * s2r) ; 



X bf [i 
X bf [i 



NUM FINGERS SQUARED 
NUM FINGERS SQUARED 



+ j].real = (BF16) (cr » 16); 
+ j] .imag = (BF16) (ci >> 16); 



} 

void 



gen R sums ( 
COMPLEX BF16 *X bf , 
COMPLEX BF8 *corr_bf, 
uchar *ptov map, 
BF32 *R sums, 
int num_j)hys_users 



) 



int i, j, k; 
BF32 sum; 

for ( i = 0; i < num phys users; i++ ) { 
for ( j = 0; j < (int) ptov_map[i] ; j++ ) { 
sum = 0; 
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for ( k = 0; k < 16; k++ ) { 

sum <BF32)X bf[k].real * (BF32) corr bf->real; 
sum += (BF32)X_bf [k] . imag * (BF32) corr_bf - >imag; 
++corr bf; 

} 

*R sums++ = sum; 

} " 

X bf += NUM FINGERS SQUARED ; 

>' " " " 

void gen R sums2 ( 

COMPLEX BF16 *X bf, 

COMPLEX BF8 *corra bf, 

COMPLEX BF8 *corrb_bf, 

uchar *ptov map, 

BF32 *R sumsa, 

BF32 *R sumsb, 

int num_j?hys_users 

) 



{ 

int i, j, k; 
BF32 suma, sumb; 

for ( i = 0; i < num phys users; i++ ) { 
for ( j = 0; j < (int)ptov_map[i] ; j++ ) { 
suma = 0; 
sumb - 0; 

for ( k = 0; k < 16; k++ ) { 

suma += (BF32)X bf[k].real * (BF32)corra bf->real; 
suma += (BF32)X bf [k] . imag * (BF32)corra bf->imag; 
sumb += (BF32)X bf[k].real * (BF32)corrb bf->real; 
sumb += (BF32)X_bf [k] .imag * (BF32) corrb_bf - >imag ; 
++corra bf; 
++corrb bf; 

} 

*R sumsa++ = suma; 
*R sumsb++ = sumb; 
} ~ 

X bf += NUM_FINGERS_SQUARED; 



void gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf, 
BF8 *scale row bf , 
int num_virt_users 

) 

{ 

int i; 

float bf_scale, fsum, fsum_scale, inv_scale, scale; 

bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_virt_users; i++ ) { 
scale = scalep [i] ,- 
fsum = (float) (R sums[i]); 
fsum *= bf_scale; 

fsum scale = fsum * inv_scale; 



7 



Page No. 193 



EV 093 931 868 US 

PageN £aS°c 2/23/200! 
fsum_scale *= scale; 

#if DO CALC STATS 

UPDATE MAX( fsum scale, max Revalue ) 

UPDATE_MAX( fsum, max_R_value ) 
#endif 

#if DO_SQUELCH 

if ( FABS ( fsum_scale ) <= SQUELCH THRESH ) fsum scale = 0.0; 
if ( FABS ( fsum ) <= SQUELCHJTHRESH ) fsum = 0.0; 
#endif 

#if DO SATURATE 

SATURATE ( fsum_scale ) 

SATURATE ( fsum ) 
ffendif 

no scale row bf [i] = BF8 FIX ( fsum ); 
scale_row_bf [i] * BF8_FIX( fsum_scale ); 

»' " " 

#endif /* COMPILER */ 
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2/23/2001 



MC Standard Algorithms PPC Macro language Version 



File Name: dotpr3_8bit .mac 

Description: Source code for routine which computes three 
dot products, combining the three sums prior 
to exit . 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 



0.4 



Date Engineer Reason 

000510 fpl Created 

000521 fpl Added num cached_rows 

000521 fpl Changed to fixed point 

000605 fpl Changed to .k file 

000926 jg Back to .mac and no dsts 



tfinclude "salppc . inc" 

#define LVX_BT( vT, rA, rB ) 

#define FUNC ENTRY 
#define VMSUM ( vT f vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
tfdefine HALF BLOCK BIT 0x20 
tfdefine QUARTER_BLOCK_B I T 0x10 

#define LO0P_BL0CK_SIZE 64 
/** 

Input parameters 
** J 

tfdefine btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
#define N r8 
#define hat_tc r9 
/** 

Local loop registers 
**/ 

#define btOptr rlO 
#define btlptr rll 
#define indexl rl2 
tfdefine index2 rl3 

tfdefine index3 rO 
#define i count hat_tc 

/* * 

G4 registers 
**/ 

tfdefine rqlO vO 

^define rqll vl 

#define rql2 v2 

#define rql3 v3 

#define zero v3 

#define rqOO v4 
#define rqOl v5 
#define rq02 v6 



LVX( vT, rA f rB ) 
dotpr3 8bit 

VMSUMMBM ( vT, vA f vB, vC ) 
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tfdefine rq03 v7 

tfdefine rqlmO v8 
tfdefine rqlml v9 
#define rqlm2 vlO 
tfdefine rqlm3 vll 

Jtdefine btlmO vl2 
tfdefine btlml vl3 
tfdefine btlm2 vl4 
tfdefine btlm3 vl5 

#define btlO vl6 
#define btll vl7 
#define btl2 vl8 
tfdefine btl3 vl9 

tfdefine btOO v20 
#define btOl v21 
#define bt02 v22 
#define bt03 v23 

#define sumO v24 
#define suml v25 
#define sum2 v26 
tfdefine sum3 v27 

/** 

Begin code text 

Setup loop registers, test for zero N 
** J 

FUNC PROLOG 

ENTRY 7( FUNC_ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 
SAVE rl3 

USE_THRU_v27( VRSAVE_COND ) 
I ** 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat_tc) 
VXOR ( sumO , sumO , sumO ) 
ADD(btlptr, btOptr, hat_tc) 

LI (indexl, 16) 

VXOR { suml , suml , suml ) 

LI(index2, 32) 

VXOR ( sum2 , sum2 , sum2 ) 

LI(index3, 48) 

VXOR ( sum3 , sum3 , sum3 ) 

SRWI C(icount, N, LOOP_COUNT_SHIFT) /* 32 sum updates per loop trip */ 
BEQ(do_half_block) 

/** 

Loop entry code 
★ */ 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
DECR_C(i count) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT{ btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT( btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK__SIZE) 

BR ( mid_loop ) 
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Loop computes three dot products held in 16 parts 

w 

LABEL ( loop ) 
/* { V 

LVX( rqlO, 0, rlptr ) 

VMSUM( sumO, rqlmO, btlO, sumO ) 

LVX( rqll, rlptr, indexl ) 

VMSUM( suml, rqlml, btll, suml ) 

LVX{ rql2, rlptr, index2 ) 

VMSUM( sum2, rqlm2, btl2, sum2 ) 

LVX( rql3, rlptr, index3 ) 

DECR_C(i count) 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sum3, rqlm3, btl3, sum3 ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT( btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 



LABEL ( mid_loop ) 

LVX( rqOO, 0, rOptr ) 
VMSUM( sumO, rqlO, btlmO, sumO ) 
LVX( rqOl, rOptr, indexl ) 
VMSUM( suml, rqll, btlml, suml ) 
LVX( rq02, rOptr, index2 ) 
VMSUM( sum2, rql2, btlm2, sum2 ) 
LVX( rq03, rOptr, index3 ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sum3, rql3, btlm3, sum3 ) 

LVX BT ( btOl, btOptr, indexl ) 

ADDI (rOptr, rOptr, LOOP BLOCK SIZE) 

LVX BT ( bt02, btOptr, index2 ) 

LVX BT ( bt03, btOptr, index3 ) 

ADDI (btOptr, btOptr, LOOP_BLOCK_SIZE) 

LVX( rqlmO, 0, rlmptr ) 
VMSUM( sumO, rqOO, btOO, sumO ) 
LVX( rqlml, rlmptr, indexl ) 
VMSUM( suml, rqOl, btOl, suml ) 
LVX( rqlm2, rlmptr, index2 ) 
VMSUM( sum2, rq02, bt02, sum2 ) 
LVX( rqlm3, rlmptr, index3 ) 

LVX BT( btlO, 0, btlptr ) 

VMSUM( sum3, rq03, bt03, sum3 ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, LOOP BLOCK_SIZE) 

LVX BT ( btl2, btlptr, index2 ) 

LVX BT( btl3, btlptr, index3 ) 

ADDI (btlptr, btlptr, LOOP_BLOCK_SIZE) 

/* } */ 

BNE ( loop ) 
I * ★ 

Loop exit code 

VMSUM( sumO, rqlmO, btlO, sumO ) 
VMSUM( suml, rqlml, btll, suml ) 
VMSUM( sum2, rqlm2, btl2, sum2 ) 
VMSUM( sum3, rqlm3, btl3, sum3 ) 

/** 

Remainders 
★ ★ J 

LABEL (do_half_block) 
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ANDI C( icount, N, HALF_BLOCK_BIT ) 

BEQ(do quarter block) 

LVX( rqlO, 0, rlptr ) 

LVX( rqll, rlptr, indexl ) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP BLOCK SIZE » 1) ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 
VMSUM( suml, rqll, btlml, suml ) 

LVX( rqOO, 0, rOptr ) 

LVX( rqOl, rOptr, indexl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDI (rOptr, rOptr, (LOOP BLOCK SIZE » 1) 

ADDI (btOptr, btOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumO, rqOO, btOO, sumO ) 
VMSUM( suml, rqOl, btOl, suml ) 

LVX( rqlmO, 0, rlmptr ) 

LVX( rqlml, rlmptr, indexl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP BLOCK SIZE » 1 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumO, rqlmO, btlO, sumO ) 
VMSUM( suml, rqlml, btll, suml ) 

LABEL (do quarter block) 

ANDI C( icount, N, QUART ER_BLOCK_B IT ) 

BEQ (combine) 

LVX( rqlO, 0, rlptr ) 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 

LVX( rqOO, 0, rOptr ) 
LVX BT( btOO, 0, btOptr ) 
VMSUM( sumO, rqOO, btOO, sumO ) 

LVX( rqlmO, 0, rlmptr ) 
LVX BT( btlO, 0, btlptr ) 
VMSUM( sumO, rqlmO , btlO, sumO ) 

I ** 

Combine sums and return 
**/ 

LABEL (combine) 

VXOR( zero, zero, zero ) 

VADDSWS( sumO, sumO, suml ) /* sOO sOl s02 s03 / 
VADDSWS( sum2, sum2 , sum3 ) /* s22 s21 s22 s23 / 
VADDSWS( sumO, sumO, sum2 ) /* sOO sOl s02 s03 */ 
VSUMSWS{ sumO, sumO, zero ) /* xxx xxx xxx sOO */ 
VSPLTW( sumO, sumO, 3 ) /* sOO sOO sOO sOO */ 

STVEWX( sumO, 0, C ) 

/ * * 

Return 
* */ 

LABEL ( ret ) 

FREE THRU_v27( VRSAVE_COND ) 

REST rl3 

RETURN 
FUNC EPILOG 
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MC Standard Algorithms PPC Macro language Version 



File Name: dotpr6_8bit .mac 

Description: Source code for routine which computes six 
dot products, combining the six sums prior 
into two outputs prior to exit. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 

0.0 
0.1 
0.2 
0.3 
0.4 



Date Engineer Reason 

000510 fpl Created 

000521 fpl Changed to fixed point 

000521 fpl Added num cached rows 

000605 fpl Changed to .k file 

000926 jg Back to .mac and no dsts 



#include "salppc . inc" 

#define LVX_BT ( vT, rA, rB ) 

#define FUNC ENTRY 
tfdefine VMSUM( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x20 
#define QUARTER_BLOCK_BIT 0x10 

tfdefine LOOP_BLOCK_SIZE 64 
/** 

Input parameters 
**/ 

tfdefine btlmptr r3 
#define rlptr r4 
tfdefine rOptr r5 
tfdefine rlmptr r6 
#define C r7 
#define N r8 
tfdefine hat_tc r9 
/** 

Local loop registers 

* */ 

#define btOptr rlO 
#define btlptr rll 
#define bt2ptr rl2 
tfdefine indexl rl3 
fldefine index2 rl4 

#define index3 rO 
#define icount hat_tc 

/** 

G4 registers 

* ★/ 

ttdefine rqlO vO 
#define rqll vl 
#define rql2 v2. 
#define rql3 v3 
^define zero v3 



LVX( vT, rA, rB ) 
dotpr6 8bit 

VMSUMMBM ( vT, vA, vB, vC ) 



#define rqOO v4 
#define rqOl v5 
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fldefine rq02 v6 
tfdefine rq03 v7 

tfdefine rqlmO v8 
#define rqlml v9 
#define rqlm2 vlO 
#define rqlm3 vll 

tfdefine btlmO vl2 
#define btlml vl3 
#define btlm2 vl4 
tfdefine btlm3 vl5 

#define btlO vl2 
#define btll vl3 
fldefine btl2 vl4 
#define btl3 vl5 

#define btOO vl6 
tfdefine btOl vl7 
#define bt02 vl8 
#define bt03 vl9 

#define bt20 vl6 
#define bt21 vl7 
#define bt22 vl8 
tfdefine bt23 vl9 



#define sumOO v20 

tfdefine sumOl v21 

tfdefine sum02 v22 

^define sum03 v23 



fldefine sumlO v24 
#define sumll v25 
#define suml2 v2 6 
tfdefine suml3 v27 
I * * 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 
SAVE rl3 rl4 

USE_THRU_v27 ( VRSAVE_COND ) 
/* ★ 

Load up local loop registers 
**/ 

ADD(bt0ptr, btlmptr, hat tc) 
VXOR(sum00, sumOO, sumOO) 
ADD(btlptr, btOptr, hat_tc) 
LKindexl, 16) 
ADD{bt2ptr, btlptr, hat_tc) 

VXOR(sum01, sumOl, sumOl) 
LI(index2, 32) 
VXOR(sum02 / sum02 , sum02) 
LI(index3, 48) 
VXOR(sum03, sum03, sum03) 



VXOR(suml0, sumlO, sumlO) 

VXOR< sumll, sumll, sumll) 

VXOR(suml2, suml2, suml2) 

VXOR(suml3, suml3 , suml3) 

SRWI C(icount, N, LOOP_COUNT_SHIFT) 

BEQ(do_half_block) 

/** 

Loop entry code 



2 

Page No. 200 



EV 093 931 868 US 
Page No. 227 

dotpr6_8bit .mac 



**/ 

LVX BT( btlmO, 0, btlmptr ) 
DECR C(icount) 

LVX BT( btlml, btlmptr, indexl ) 

LVX BT( btlm2, btlmptr, index2 ) 

LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 

LVX{ rqll, rlptr, indexl ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LVX( rql2, rlptr, index2 ) 

LVX( rql3, rlptr, index3 ) 

BR ( mid_loop ) 

/ ** 

Loop computes three dot products held in 16 parts 
**/ 

LABEL ( loop ) 

/* { */ 

LVX BT( btlmO, 0, btlmptr ) 
VMSUM( sumlO, rqlmO, bt20, sumlO ) 
LVX BT( btlml, btlmptr, indexl ) 
VMSUM( sumll, rqlml, bt21, sumll ) 
LVX BT( btlm2, btlmptr, index2 ) 
DECR C(icount) 

VMSUM( suml2, rqlm2 , bt22, suml2 ) 
LVX_BT( btlm3, btlmptr, index 3 ) 

LVX( rqlO, 0, rlptr ) 

VMSUM( suml3, rqlm3, bt23, suml3 ) 

LVX( rqll, rlptr, indexl ) 

LVX( rql2, rlptr, index2 ) 

ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 

LVX( rql3, rlptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LABEL ( mid_loop ) 

LVX BT( btOO, 0, btOptr ) 
VMSUM( sumOO, rqlO, btlmO, sumOO ) 
LVX BT( btOl, btOptr, indexl ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 
LVX BT( bt02, btOptr, index2 ) 
VMSUM( sum02, rql2, btlm2, sum02 ) 
LVX BT( bt03, btOptr, index3 ) 
ADDI (rlptr, rlptr, LOOP_BLOCK_SIZE) 

LVX( rqOO, 0, rOptr ) 

VMSUM{ sum03, rql3, btlm3, sum03 ) 

LVX( rqOl, rOptr, indexl } 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rq02, rOptr, index2 ) 

VMSUM( sumll, rqll, btOl, sumll ) 

ADDI (btOptr, btOptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rq!2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

VMSUM( suml3, rql3, bt03, suml3 ) 

LVX BT( btlO, 0, btlptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 

LVX BT( btll, btlptr, indexl ) 
ADDI (rOptr, rOptr, LOOP BLOCK_SIZE) 

LVX BT( btl2, btlptr, index2 ) 

VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btl3, btlptr, index3 ) 

VMSUM( sum02, rq02, bt02, sum02 ) 



LVX( rqlmO, 0, rlmptr ) 
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VMSUM( sum03, rq03, bt03, sum03 ) 
ADDI (btlptr, btlptr, LOOP BLOCK SIZE) 
VMSUM( sumlO, rqOO, btlO, sumlO ) 
LVX( rqlml, rlmptr, indexl ) 
VMSUM{ sumll, rqOl, btll, sumll ) 
LVX{ rqlm2, rlmptr, index2 ) 
VMSUM( suml2, rq02, btl2, suml2 ) 
LVX( rqlm3, rlmptr, index3 ) 
ADDI (rlmptr, rlmptr, LOOP_BLOCK_SIZE) 

LVX BT( bt20, 0, bt2ptr ) 

VMSUM{ suml3, rq03, btl3, suml3 ) 

LVX BT( bt21, bt2ptr, indexl ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) 

LVX BT( bt22, bt2ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt23, bt2ptr, index3 ) 

VMSUM( sum02, rqlm2, btl2, sum02 ) 

VMSUM( sum03, rqlm3, btl3, sum03 > 

/* } */ 

BNE ( loop ) 

'/** 

Loop exit code 
+ */ 

VMSUM( sumlO, rqlmO, bt20, sumlO ) 
VMSUM( sumll, rqlml, bt21, sumll ) 
ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 
VMSUM( suml2, rqlm2, bt22, suml2 ) 
VMSUM( suml3, rqlm3, bt23, suml3 ) 

/** 
Remainders 
*★/ 

LABEL (do half block) 

ANDI C( icount, N, HALF_BLOCK_B IT ) 
BEQ (do_quarter_block) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE » 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl } 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 
VMSUM( sumll, rqll, btOl, sumll ) 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 

ADDI (rOptr , rOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 
VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) 
VMSUM( sumll, rqOl, btll, sumll ) 
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LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt20, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumlO, rqlmO, bt20, sumlO ) 
VMSUM( sumll, rqlml, bt21, sumll ) 

LABEL (do quarter block) 

ANDI C( icount, N, QUARTER__BLOCK_B I T ) 
BEQ (combine) 

LVX BT( btlmO, 0, btlmptr ) 

LVX( rqlO, 0, rlptr ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 



LVX BT( btlO, 
VMSUM ( sumlO, 
LVX( rqlmO, 0 
VMSUM ( sumOO, 
LVX BT( bt20, 
VMSUM ( sumlO, 



0, btlptr ) 
rqOO, btlO, sumlO ) 
, rlmptr ) 

rqlmO, btlO, sumOO ) 

0, bt2ptr ) 

rqlmO, bt2 0, sumlO ) 



Combine sums and return 
**/ 

LABEL (combine) 

VXOR( zero, zero, zero ) 

VADDSWS( sumOO, sumOO, sumOl ) 

VADDSWS( sumlO, sumlO, sumll ) 

VADDSWS( sum02, sum02, sum03 ) 

VADDSWS( suml2, suml2, suml3 ) 

VADDSWS( sumOO, sumOO, sum02 ) 

VADDSWS( sumlO, sumlO, suml2 ) 

VSUMSWS( sumOO, sumOO, zero ) 

VSUMSWS( sumlO, sumlO, zero ) 

VSPLTW( sumOO, sumOO, 3 ) 

STVEWX( sumOO, 0, C ) 

ADDI ( C, C, 4 ) 

VSPLTW( sumlO, sumlO, 3 ) 

STVEWX( sumlO, 0, C ) 

/** 
Return 
★ */ 

LABEL ( ret ) 

FREE THRU v27 ( VRSAVE_COND ) 

REST rl3_rl4 

RETURN 
FUNC EPILOG 



/* sOO sOl s02 s03 */ 
/* s22 s21 s22 s23 */ 
/* sOO sOl s02 s03 */ 



xxx xxx xxx sOO */ 
sOO sOO sOO sOO */ 
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--- MC Standard Algorithms -- PPC Macro language Version 



File Name: dotpr9_8bit . mac 

Description: Source code for routine which computes nine 
dot products, combining the nine sums prior 
into three outputs prior to exit. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 

0.0 
0.1 
0.2 
0.3 
0.4 



Date 

000510 
000512 
000521 
000605 
000926 



Engineer Reason 



fpl 
fpl 
fpl 
fpl 

jg 



Created 

Added num cached_rows 
Changed to fixed point 
Changed to .k file 
Back to .mac and no dsts 



#include "salppc . inc" 

#define LVX_BT( vT, rA, rB ) 

ttdefine FUNC ENTRY 

#define VMSUM( vT, vA, vB, vC ) 

define LOOP COUNT SHIFT 6 
tfdefine HALF BLOCK BIT 0x20 
tfdefine QUARTER_BLOCK_BIT 0x10 

tfdefine LOOP_BLOCK_SIZE 64 
/** 

Input parameters 
W 

tfdefine btlmptr r3 
#define rlptr r4 
#define rOptr r5 
tfdefine rlmptr r6 
#define C r7 
#define N r8 
#define hat__tc r9 
/** 

Local loop registers 
**/ 

tfdefine btOptr rlO 
#define btlptr rll 
#define bt2ptr rl2 
. tfdefine bt3ptr rl3 
#define indexl rl4 
^define index2 rl5 

^define index3 rO 
#define icount hat_tc 



LVX( vT, rA, rB ) 
dotpr9 8bit 

VMSUMMBM { vT, vA, vB, vC ) 



registers 



/** 
G4 

**/ 

^define rqlO vO 
^define rqll vl 
^define rql2 v2 
#define rql3 v3 
tfdefine zero v3 



#define bt30 vO 
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tfdefine bt31 vl 
fldefine bt32 v2 
tfdefine bt33 v3 

^define rqOO v4 

tfdefine rqOl v5 

^define rq02 v6 

tfdefine rq03 v7 

fldefine rqlmO v8 
#define rqlml v9 
tfdefine rqlm2 vlO 
#define rqlm3 vll 

#define btlmO vl2 
tfdefine btlml vl3 
#define btlm2 vl4 
#define btlm3 vl5 

tfdefine btlO vl2 
tfdefine btll vl3 
#define btl2 vl4 
tfdefine btl3 v!5 

tfdefine btOO vl6 
#define btOl vl7 
fldefine bt02 vl8 
tfdefine bt03 vl9 

#define bt20 vl6 
tfdefine bt21 vl7 
tfdefine bt22 vl8 
#define bt23 vl9 

tfdefine sumOO v20 

tfdefine sumOl v21 

#define sum02 v22 

tfdefine sum03 v23 

#define sumlO v24 
tfdefine sumll v25 
#define suml2 v26 
#define suml3 v27 

tfdefine sum20 v28 
#define sum21 v29 
tfdefine sum22 v30 
tfdefine sum23 v31 
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/** 

Begin code text 
**/ 

FUNC PROLOG 
ENTRY 7( FUNC ENTRY 
SAVE rl3 rl5 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

Load up local loop registers 
**/ 

ADD(btOptr / btlmptr, hat tc) 
VXOR(sum00, sumOO, sumOO) 
ADD(btlptr, btOptr 
LI(indexl, 16) 
ADD(bt2ptr, btlptr 
VXOR ( sumO 1 , sumO 1 , 
ADD(bt3ptr, bt2ptr 



btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 



, hat_tc) 

, hat tc) 
sumOl) 
. hat tc) 
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LI(index2, 32) 

VXOR(sum02 / sum02, sum02) 
LI(index3, 48) 

VXOR ( sumO 3 , sumO 3 , sumO 3 ) 

VXOR(suml0, sumlO, sumlO) 

VXOR ( suml 1 , suml 1 , suml 1 ) 

VXOR(suml2, suml2, suml2) 

VXOR(suml3, suml3, suml3) 



VXOR(sum20, sum20 , sum20) 

VXOR(sum21, sum21, sum21) 

VXOR(sum22 f sum22, sum22) 

VXOR(sum23, sum23, sum23) 

SRWI C(icount, N, LOOP_COUNT_SHIFT) 

BEQ(do_half_block) 

/** 

Loop entry code 
**/ 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

DECR C(icount) 

LVX BT( btlm2, btlmptr, index2 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LVX( rqll, rlptr, indexl ) 

LVX( rql2, rlptr, index2 ) 

LVX( rql3, rlptr, index3 ) 

LVX_BT( btOO, 0, btOptr ) 

BR ( mid_JLoop ) 

/** 

Nine dot products producing 3 sums: 
sumO = (Rl * Btlm) (R0 * BtO) (Rim * Btl) 
suml = (Rl * BtO) (R0 * Btl) (Rim * Bt2) 
sum2 = (Rl * Btl) (R0 * Bt2) (Rim * Bt3) 
**/ 

LABEL ( loop ) 

/* { */ 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sum20, rqlmO , bt30, sum20 ) /* Rim * Bt3 */ 

LVX BT( btlml, btlmptr, indexl ) 

VMSUM( sum21, rqlml , bt31, sum21 ) 

LVX BT( btlm2, btlmptr, index2 ) 

VMSUM( sum22, rqlm2 , bt32, sum22 ) 

LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 
VMSUM( sum23, rqlm3 , bt33, sum23 ) 
ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rqll, rlptr, indexl ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* R0 * Bt2 */ 
LVX( rql2, rlptr, index2 ) 
VMSUM( sum21, rqOl, bt21, sum21 ) 
DECR C(icount) 

VMSUM( sum22, rq02, bt22, sum22 ) 
LVX( rql3, rlptr, index3 ) 

VMSUM( sum23, rq03, bt23, sum23 ) 
LVX_BT( btOO, 0, btOptr ) 

/** 

Loop entry 
* * / 

LABEL ( mid_loop ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 
LVX_BT( btOl, btOptr, indexl ) 
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ADDKrlptr, rlptr, LOOP BLOCK SIZE) 
LVX BT( bt02, btOptr, index2 ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 
LVX_BT( bt03, btOptr, index3 ) 

VMSUM( sum02, rql2, btlm2, sum02 ) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sum03, rql3 # btlm3, sum03 ) 

ADDI (btOptr, btOptr, LOOP_BLOCK_SIZE) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

LVX( rqOl, rOptr, indexl ) 

VMSUM( sumll, rqll, btOl, sumll ) 

LVX( rq02, rOptr, index2 ) 

VMSUM{ suml2, rql2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

ADDI (rOptr , rOptr, LOOP BLOCK SIZE) 
VMSUM( suml3, rql3, bt03, suml3 ) 
LVX BT( btlO, 0, btlptr ) 
LVX BT( btll, btlptr, indexl ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) /* RO BtO */ 
LVX BT( btl2, btlptr, index2 ) 
VMSUM( sumOl, rqOl, btOl, sumOl ) 
LVX_BT( btl3, btlptr, index3 ) 

VMSUM( sum02, rq02, bt02 , sum02 ) 
VMSUM( sum03, rq03, bt03, sum03 ) 
LVX( rqlmO, 0, rlmptr ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl Btl V 

LVX{ rqlml, rlmptr, indexl ) 

VMSUM( sum21, rqll, btll, sum21 ) 

LVX( rqlm2, rlmptr, index2 ) 

ADDI {btlptr, btlptr, LOOP BLOCK_SIZE) 

LVX( rqlm3, rlmptr, index3 ) 

VMSUM( sum22, rql2, btl2, sum22 ) 

LVX BT( bt20, 0, bt2ptr ) 

VMSUM( sum23, rql3, btl3, sum23 ) 

LVX BT ( bt21, bt2ptr, indexl ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) 
ADDI (rlmptr, rlmptr, LOOP_BLOCK_SIZE) 

VMSUM( sumll, rqOl, btll, sumll ) 

LVX BT ( bt22, bt2ptr, index2 ) 

VMSUM( suml2, rq02, btl2, suml2 ) 

LVX BT ( bt23, bt2ptr, index3 ) 



/* RO * Btl */ 



/* Rim * Btl */ 



VMSUM( suml3, rq03, btl3, suml3 ) 

LVX BT ( bt30, 0, bt3ptr ) 

LVX BT ( bt31, bt3ptr, indexl ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) 

LVX BT ( bt32, bt3ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX_BT( bt33, bt3ptr, index3 ) 

VMSUM{ sum02, rqlm2 , btl2, sum02 ) 

VMSUM( sum03, rqlm3 , btl3, sum03 ) 

ADDI(bt2ptr, bt2ptr, LOOP BLOCK SIZE) 

VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 / 

VMSUM{ sumll, rqlml, bt21, sumll ) 

ADDI(bt3ptr, bt3ptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rqlm2, bt22, suml2 ) 

VMSUM( suml3, rqlm3 , bt23, suml3 ) 

/* ) */ 

BNE ( loop ) 

/** 

Loop exit code 
**/ 
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VMSUM ( sum20 / rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 
VMSUM ( sum21, rqlml, bt31, sum21 ) 
VMSUM( sum22, rqlm2, bt32 f sum22 ) 
rqlm3, bt33 f sum23 ) 
rqOO, bt20, sum20 ) /* RO * Bt2 */ 
rqOl, bt21, sum21 ) 
rq02, bt22, sum22 ) 
rq03, bt23, sum23 ) 



VMSUM ( sum23, 
VMSUM ( sum20, 
VMSUM ( sum21, 
VMSUM { sum22, 
VMSUM ( sum23, 



Remainders 
**/ 

LABEL (do half block) 

ANDI C( icount, N f HALF_BLOCK_BIT ) 
BEQ (do_quarter_block) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE » 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 



VMSUM ( sumOO, rqlO, btlmO, sumOO ) /* Rl 
VMSUM ( sumOl, rqll, btlml, sumOl ) 



* Btlm */ 



LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 
VMSUM ( sumll, rqll, btOl, sumll ) 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 

ADDI (rOptr , rOptr, (LOOP_BLOC RESIZE » 1) ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 
VMSUM ( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM ( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 

VMSUM ( sum21, rqll, btll, sum21 ) 

VMSUM ( sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 

VMSUM ( sumll, rqOl, btll, sumll ) 

LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (L00PJJL0CKJ3IZE » 1) ) 

VMSUM ( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 
VMSUM ( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt20, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM ( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 

VMSUM ( sum21, rqOl, bt21, sum21 ) 

VMSUM ( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 */ 

VMSUM ( sumll, rqlml, bt21, sumll ) 
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LVX BT( bt30, 0, bt3ptr ) 

LVX BT( bt31, bt3ptr, indexl ) 

ADDI(bt3ptr, bt3ptr, <I/X)P_BLOCK_SIZE » 1) ) 

VMSUMt sum20, rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 
VMSUM< sum21, rqlml, bt31, sum21 ) 

/** 

four more sums 
**/ 

LABEL (do quarter block) 

ANDI C( icount, N, QUARTER_BLOCK_BIT ) 
BEQ (combine) 

LVX BT ( btlmO, 0, btlmptr ) 

LVX( rqlO, 0, rlptr ) ' 
VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl Btlm / 
ADDI (btlmptr, btlmptr, 16) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl BtO */ 

S^sumO^, rqoo! btOO, sumOO ) /* RO * BtO */ 

LVX_BT( btlO, 0, btlptr ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 

VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO Btl */ 

LVX( rqlmO, 0, rlmptr ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl / 
LVX BT ( bt20, 0, bt2ptr ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 
VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 / 

LVX BT( bt30, 0, bt3ptr ) 

VMSUM( sum20, rqlmO, bt30, sum20 ) /* Rim Bt3 / 

J ** 

Combine sums and return 
* ★/ 

LABEL (combine) 

VXOR( zero, zero, zero ) 
VADDSWS( sumOO, sumOO, sumOl ) 
VADDSWS( sumlO, sumlO, sumll ) 
VADDSWS( sum20, sum20, sum21 ) 

VADDSWS( sum02, sum02, sum03 ) 
VADDSWS( suml2, suml2, suml3 ) 
VADDSWS( sum22, sum22, sum23 ) 

VADDSWS( sumOO, sumOO, sum02 ) 
VADDSWS( sumlO, sumlO, suml2 ) 
VADDSWS( sum20, sum2 0, sum22 ) 

VSUMSWS( sumOO, sumOO, zero ) /* xxx xxx xxx sOO */ 
VSUMSWS( sumlO, sumlO, zero ) 
VSUMSWS( sum20, sum2 0, zero ) 

VSPLTW( sumOO, sumOO, 3 ) /* sOO sOO sOO sOO */ 

STVEWX( sumOO, 0, C ) 

ADDI ( C, C, 4 ) 

VSPLTW( sumlO, sumlO, 3 ) 

STVEWX( sumlO, 0, C ) 

ADDI ( C, C, 4 ) 

VSPLTW( sum20, sum20, 3 ) 

STVEWX( sum20, 0, C ) 
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/** 
Return 

**/ 

LABEL ( ret ) 

FREE THRU v31 ( VRSAVE_COND ) 

REST rl3_rl5 

RETURN 
FUNC EPILOG 
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#ifndef MCOS 55 
fldefine MCOS__55 0 
ttendif 
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MC 



Standard Algorithms 603e Macro language Version^ 



File Name: CDOTPR.MAC 

Description: Vector Single Precision Complex Dot Product 
Entry/params: CDOTPR (A, I, B, J, C, N) 
Formula: C[0] = sum <A[mI]*B[mJ] - i?*iwmii \ 

C(l] = sum (A[mI]*B[mJ+l] + A [ml+l] *B [mJ] ) 
for m=0 to N-l 

Mercury Computer Systems, Inc. 
Copyright (c) 1995 All rights reserved 

Engineer Reason 



Revision 


Date 


0.0 


960502 


0.1 


960618 


0.2 


970128 


0.3 


970203 


0.4 


970522 


0.5 


980325 


0.6 


980404 


0.7 


980708 


0.8 


980820 


0.9 


981019 


0 .10 


981025 


0 .11 


990310 


0 .12 


990730 


1.0 


000223 


1.1 


000305 


1.2 


000607 


1.3 


000610 



fpl Created 

fpl Added Esal entry 

fpl Added debt logic 

fpl Corrected ABIT define 

jfk Added new debx test macros 

fpl Added 74 0 code segment 

fpl Removed loop stall 

fpl Added build macros 

jfk Added new DCBT macro 

fpl Added z function 

fpl Modified z entry 

fpl 750/G4 integration 

fpl Added conjugate entry 

fpl Increased minimum VMX count 
jfkremoved branches to entrypoints 

jfk Fixed floating point save bug 

fpl Added new API macro 



^include "salppc . inc" 

Jdef iL B BR I IF V v^X Z Z2 ( root_name, uroot name, min n imm unit s imm, 
* " - ~ prl, pil, si, pr2, pi2, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z__skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \ # 

cmpwi s2, unit s imm; \ 

xor rO, prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, pr2, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO , Oxf; \ 

xor rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

bne z unaligned vmx; \ 

BR VMX Z2 ( root_name, eflag, si ) \ 
z_unaligned vmx: \ 

BR VMX 7.2 { uroot_name, eflag, si ) \ 
z_skip_vmx: 

ttdefine ACOND 5 
#define ABIT 2 
#define BCOND 6 
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ttdefine BBIT 1 



/** 

API registers 
**/ 

ttdefine A r3 
#define I r4 
#define B r5 
#define J r6 
ttdefine C r7 
#define N r8 
ttdefine EFLAG r9 



/** 

z input args 
**/ 

ttdefine Ar 
ttdefine Ai 
ttdefine Br 
ttdefine Bi 
ttdefine Cr 
ttdefine Ci 



A 

no 
B 

rll 
C 

r!2 
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Local registers 
**/ 

ttdefine count rl3 
ttdefine rtmp rl3 
ttdefine nextline rl4 

/** 

Fpu registers 
* * / 

ttdefine rsumrO fO 

ttdefine rsumiO fl 

ttdefine isumrO f2 

ttdefine isumiO f3 

ttdefine arO f4 
ttdefine aiO f5 
ttdefine arl f6 
ttdefine ail f7 
ttdefine ar2 f8 
ttdefine ai2 f9 
ttdefine ar3 flO 
ttdefine ai3 fll 



ttdefine brO fl2 
ttdefine biO fl3 
ttdefine brl fl4 
ttdefine bil fl5 
ttdefine br2 fl6 
ttdefine bi2 fl7 
ttdefine br3 fl8 
ttdefine bi3 fl9 

#if defined! BUILD_MAX ) 
ttif MCOS 55 

DECLARE_VMX_Z2 ( _zdotpr_vmx_cc ) 
ttelse 

DECLARE_VMX_Z2 ( _zdotpr_vmx ) 
ttendif 

DECLARE_VMX_Z2 ( _zdotpr4_vmx ) 
ttendif 



Code text: Conjugate 
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**/ 

FUNC PROLOG 
#ifndef COMPILE C 
U_ENTRY( fixed cidotpr 



2/23/2001 



) 



/* Fortran SAL */ 



/* c 
/* 



/* C SAL */ 
NNN EFLAG (default) */ 
common path */ 

/* Fortran ESAL */ 

ESAL */ 

common path */ 



/* common path */ 



N ) 



FORTRAN DREF 3(1, J, N ) 
U_ENTRY( fixed cidotpr ) 
LI( EFLAG, SAL NNN ) 
BR ( cidotprx common ) 
U ENTRY ( fixed cidotprx ) 

FORTRAN DREF 4( I, J, N, EFLAG ) 
U ENTRY ( fixed cidotprx ) 
LABEL ( cidotprx common ) 
ADDI ( Ai, Ar, 4 ) 
MR( Bi , Br ) 
ADDI ( Br, Br, 4 ) 
MR( Ci, Cr ) 
ADDI ( Cr, Cr, 4 ) 
BR ( common ) 

/** 
Normal 

**/ 

FUNC PROLOG 
#ifndef COMPILE C 
U_ENTRY( fixed cdotpr_ ) 
FORTRAN DREF 3 ( I, J, 
U ENTRY ( fixed cdotpr ) 
LI( EFLAG, SAL NNN ) 
BR( cdotprx common ) 
U_ENTRY ( fixed cdotprx ) 

FORTRAN DREF 4 ( I , J, N, 
U ENTRY ( fixed cdotprx ) 
LABEL ( cdotprx common ) 
ADDI ( Ai, Ar, 4 ) 
ADDI ( Bi, Br, 4 ) 
ADDI ( Ci, Cr, 4 ) 
BR ( common ) 

/** 

Split complex entries 
**/ 

U_ENTRY( fixed zidotpr ) 

~ FORTRAN DREF 3( I, J, 
U_ENTRY( fixed zidotpr ) 
LI( EFLAG, SAL NNN ) 
BR ( zidotprx common ) 
U_ENTRY ( fixed zidotprx 
FORTRAN_DREF_4 ( I , J , 
#endif 

ENTRY 7( fixed zidotprx, 
LABEL ( zidotprx_common ) 

' Assign split complex pointers, do the conjugate trick 
**/ 



EFLAG ) 



/* Fortran SAL */ 

/* C SAL */ 
/* NNN EFLAG (default) */ 
/* common path */ 

/* Fortran ESAL */ 

/* C ESAL */ 

/* common path */ 



/* common path */ 
Conjugate 



N ) 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) H 



I 



) 

N, 



EFLAG ) 



/* Fortran ESAL */ 



A, I, B, J, C, N, EFLAG) 



LWZ( Ai, A, 4 

LWZ( Ar, A, 0 

LWZ( Bi, B, 0 

LWZ( Br, B, 4 

LWZ( Ci, C, 0 

LWZ( Cr, C, 4 
BR ( z common ) 



/** 
Normal 

**/ 

U_ENTRY( fixed zdotpr_ ) 
FORTRAN DREF 3 ( I , J , 

U_ENTRY( fixed zdotpr ) 
LI ( EFLAG, SAL NNN ) 
BR( zdotprx_common ) 



N ) 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) 1 
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U ENTRY ( fixed zdotprx ) /* Fortran ESAL */ 

"~ F0RTRAN_DREF_4 ( I, J, N, EFLAG ) 
#endif 

/** 
C ESAL 

ENTRY 7( fixed zdotprx, A, I, B, J, C, N ( EFLAG) 
DECLARE rlO rl4 
DECLARE_fO_fl9 

LABEL ( zdotprx_common ) 



Assign split complex pointers 

W LWZ( Ai, A, 4 ) /* must load imag first since Ar reg = A reg */ 
LWZ( Ar, A, 0 ) 
LWZ{ Bi, B, 4 ) 
LWZ( Br, B, 0 ) 
LWZ( Ci, C, 4 ) 
LWZ( Cr, C, 0 ) 

/** 

VMX API filter 

Test if okay to enter VMX code and branch to VMX code 
VMX loop - process all N points 
**/ 

LABEL ( z_common ) 

#if defined ( BUILD_MAX ) 



tfdefine MIN VMX N 20 
#define UNIT_STRIDE 1 



#lf BR C ?F WX Z2< zdotpr vmx cc, zdotpr4_vmx, MIN_VMX_N, UNIT_STRIDE, \ 

- - - Arf Ai,~~I, Br, Bi, J, N, EFLAG ) 

#el fit IF VMX Z2( zdotpr_vmx, zdotpr4_vmx, MIN_VMX_N, UNIT_STRIDE, \ 

- - - Ar , Ai, I, Br, Bi, J, N, EFLAG ) 
#endif 

#endif /* BUILD_MAX */ 



/** 

Point of common path where all entries join 
Test for small counts 
**/ 

LABEL ( common ) 
SAVE rl3 rl4 
SAVE fl4 fl9 
CMPLWI (N, 0) 
BEQ(ret) 
CMPLWI (N, 1) 
BEQ(dol) 
CMPLWI (N, 2) 
BEQ(do2) 
CMPLWI (N, 3) 
BEQ(do3) 

/** 

check for uncached (and local) vectors 
** 7 SET_2_DCBTj:OND( ACOND, ABIT, BCOND, BBIT, EFLAG, rtmp ) 
LKnextline, 32) 

/** 

74 0 code segment, start up loop code 
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lit defined( BUILD 750 ) || defined( BUILD_MAX ) 
LFS( arO, Ar, 0 ) 

SRWK count, N, 2 ) /* count = N >> 2 */ 
LFS( brO f Br, 0 ) 

SLWI( I, I, 2 ) /* byte strides */ 

LFS( aiO, Ai, 0 ) 
SLWI ( J, J, 2 ) 
LFS( biO, Bi, 0 ) 



LFSUX( arl, Ar, I ) 

LFSUX( brl, Br, J ) 

LFSUX( ail, Ai, I ) 

LFSUX( bil, Bi, J ) 

LFSUX( ar2, Ar, I ) 

LFSUX( br2, Br, J ) 

LFSUX( ai2, Ai, I ) 

LFSUX( bi2, Bi , J ) 



FMULS( rsumrO, arO, brO ) 
LFSUX( ar3, Ar, I ) 
LFSUX( br3, Br, J ) 
FMULS( rsumiO, aiO, biO ) 
LFSUX( ai3, Ai , I ) 
LFSUX( bi3, Bi, J ) 
FMULS( isumiO, arO, biO ) 
DECR C( count ) 
FMULS( isumrO, aiO, brO ) 
BEQ( flush loop_74 0) 
BR(mloop_740) 

/** 

Top of 74 0 loop 
**/ 

LABEL (loop_74 0) 

LFSUX( ar3, Ar, I ) 

FMADDS { rsumrO, arO , brO, rsumrO ) 

LFSUX( br3, Br, J ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

LFSUX( ai3, Ai, I ) 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 

LFSUX( bi3, Bi, J ) 

LABEL ( ml Oop_74 0) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 
LFSUX( arO, Ar, I ) 

OCBT IF( ACOND, Ar, nextline ) 
FMADDS ( rsumiO, ail, bil, rsumiO ) 
LFSUX{ brO, Br, J ) 

DECR C( count ) 

FMADDS ( isumiO, arl, bil, isumiO ) 
LFSUX( aiO, Ai, I ) 
FMADDS { isumrO, ail, brl, isumrO ) 
LFSUX( biO, Bi, J ) 

DCBT IF( BCOND, Br, nextline ) 

FMADDS ( "rsumrO, ar2, br2, rsumrO ) 

LFSUX( arl, Ar, I ) 

LFSUX( brl, Br, J ) 

FMADDS ( rsumiO, ai2, bi2, rsumiO ) 

LFSUX( ail, Ai, I ) 

FMADDS ( isumiO, ar2, bi2, isumiO ) 

LFSUX( bil, Bi , J ) 

FMADDS ( isumrO, ai2, br2, isumrO ) 
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FMADDS ( rsumrO, ar3 , br3, rsumrO ) 

LFSUX( ar2, Ar, I ) 

FMADDS ( rsumiO, ai3, bi3, rsumiO ) 

LFSUX( br2, Br, J ) . 

FMADDS ( isumiO, ar3, bi3, isumiO ) 

LFSUX{ ai2, Ai, I ) 

LFSUX( bi2, Bi, J ) 

FMADDS ( isumrO, ai3, br3 , isumrO ) 

BNE ( loop_740 ) 

J* * 

Finish last pass 
**/ 

FMADDS ( rsumrO, arO, brO, rsumrO ) 

LFSUX( ar3, Ar, I ) 

LFSUX( br3, Br, J ) f 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

LFSUX( ai3, Ai, I ) 

LFSUX{ bi3, Bi, J ) ( 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 



LABEL ( flush loop 740 ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 
FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS { isumiO, arl, bil, isumiO ) 
FMADDS ( isumrO, ail, brl, isumrO ) 

FMADDS ( rsumrO, ar2, br2, rsumrO ) 

FMADDS ( rsumiO, ai2, bi2, rsumiO ) 

FMADDS ( isumiO, ar2 , bi2, isumiO ) 

FMADDS ( isumrO, ai2, br2 , isumrO ) 

FMADDS ( rsumrO, ar3 , br3, rsumrO ) 
FMADDS { rsumiO, ai3, bi3, rsumiO ) 
FMADDS ( isumiO, ar3, bi3, isumiO ) 
FMADDS ( isumrO, ai3, br3 , isumrO ) 
BR (remain) 

#endif /** 750 specific code section 

/** 

set up for loop entry, here if N >= 2 
**/ 

#if defined ( BUILD_603 ) 
LABEL (start 603) 

) 



/* byte strides */ 

) /+ count = N >> 2 */ 
) 



LFS( arO, Ar, 0 
SLWI( I, I, 2 ) 
LFS( aiO, Ai, 0 ) 
SRWI( count, N, 2 
LFSUX( arl, Ar, I 
SLWI( J, J, 2 ) 
LFSUX( ail, Ai, I ) 
LFSUX( ar2, Ar, I ) 
LFSUX{ ai2, Ai, I ) 
LFSUX( ar3, Ar, I ) 
LFSUX( ai3, Ai, I ) 
DCBT IF( ACOND, Ar, nextline ) 



LFS( brO, Br, 0 ) 
DECR_C( count ) 
LFS( biO, Bi, 0 ) 
LFSUX( brl, Br, J ) 
LFSUX( bil, Bi, J ) 
LFSUX( br2, Br, J ) 
LFSUX( bi2, Bi, J ) 
LFSUX( br3, Br, J ) 
LFSUX( bi3, Bi, J ) 
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DCBT_IF( BCOND, Br, next line ) 

FMULS( rsumrO, arO, brO ) 

FMULS( rsumiO, aiO, biO ) 

FMULS( isumiO, arO, biO ) 

FMULS( isumrO, aiO, brO ) 



rsumrO , 


arl, 


brl. 


rsumrO 


) 


rsumiO, 


ail, 


bil, 


rsumiO 


) 


isumiO; 


arl, 


bil, 


isumiO 


) 


isumrO, 


ail, 


brl, 


isumrO 


) 


rsumrO , 


ar2, 


br2, 


rsumrO 


) 


rsumiO, 


ai2, 


bi2, 


rsumiO 


) 


isumiO, 


ar2, 


bi2, 


isumiO 


) 


isumrO, 


ai2, 


br2, 


isumrO 


) 


rsumrO , 


ar3, 


br3, 


rsumrO 


) 


rsumiO, 


ai3, 


bi3, 


rsumiO 


) 


isumiO, 


ar3, 


bi3, 


isumiO 


) 


isumrO, 


ai3 , 


br3, 


isumrO 


) 



/ 



BEQ ( remain ) 



main loop maintains four partial sums 
representing two complex sum updates per pass 

**/ 

LABEL (loop) 

LFSUX( arO, Ar, I ) 
LFSUX( aiO, Ai , I ) 
LFSUX( arl, Ar, I ) 
LFSUX( ail, Ai, I ) 
LFSUX( ar2, Ar, I ) 
LFSUX( ai2, Ai, I ) 
LFSUX( ar3, Ar, I ) 
LFSUX( ai3, Ai, I ) 
DCBT_IF( ACOND, Ar, nextline ) 

DECR C( count ) 

LFSUX{ brO, Br, J ) 

LFSUX( biO, Bi, J ) 

LFSUX( brl, Br, J ) 

LFSUX( bil, Bi, J ) 

LFSUX( br2, Br, J ) 

LFSUX( bi2, Bi, J ) 

LFSUX( br3, Br, J ) 

LFSUX( bi3, Bi, J ) 

DCBT_IF( BCOND, Br, nextline ) 

FMADDS { rsumrO, arO, brO, rsumrO ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS { rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 

FMADDS ( rsumrO, ar2, br2, rsumrO ) 

FMADDS ( rsumiO, ai2, bi2, rsumiO ) 

FMADDS { isumiO, ar2, bi2, isumiO ) 

FMADDS ( isumrO, ai2, br2, isumrO ) 

FMADDS ( rsumrO, ar3, br3 , rsumrO ) 

FMADDS ( rsumiO, ai3, bi3, rsumiO ) 

FMADDS ( isumiO, ar3 , bi3, isumiO ) 

FMADDS ( isumrO, ai3, br3 , isumrO ) 
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BNE ( loop ) 
#endif /** 603 specific code section **/ 

/" 

remainder loop 

w 

LABEL (remain) 

ANDI_C( count, N, 2 ) /* bit 2 */ 
BEQ{ suml ) 

. LFSUX( arO, Ar, I ) 

LFSUX( aiO, Ai, I ) 

LFSUX( arl, Ar, I ) 

LFSUX( ail, Ai, I ) 



LFSUX( brO, Br, J ) 
LFSUX( biO, Bi, J ) 
LFSUX( brl, Br, J ) 
LFSUX( bil, Bi, J ) 

FMADDS ( rsumrO, arO, brO, rsumrO ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 



/* bit 0 */ 

/* if no sums left */ 



LABEL (suml) 

ANDI_C( count, N, 1 
BEQ ( combine ) 

LFSUX( arO, Ar, I ) 

LFSUX( brO, Br, J ) 

LFSUX{ aiO, Ai, I ) 

LFSUX( biO, Bi, J ) 

FMADDS ( rsumrO, arO, brO, rsumrO ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

FMADDS ( isumiO, arO , biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 

Combine partial sums, write out results and return 



STFS( rsumrO, Cr, 0 ) 
FADDS ( isumiO, isumiO, isumrO ) 
STFS( isumiO, Ci , 0 ) 
BR (ret) 

/** 

here for N = 1,2,3 

w 

LABEL (do3 ) 

LFS{ arO, Ar, 0 ) 
SLWK I, I, 2 ) 
LFS( aiO, Ai, 0 ) 
LFSUX( arl, Ar, I 
SLWK J, J, 2 ) 
LFSUX( ail, Ai, I 
LFSUX( ar2, Ar, I 
LFSUX( ai2, Ai, I 



/* byte strides */ 



LFS( brO, Br, 0 ) 
DECR_C( count ) 
LFS( biO, Bi, 0 ) 
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LFSUX( brl, Br, J ) 

LFSUX( bil, Bi, J ) 

LFSUX( br2, Br, J ) 

LFSUX( bi2, Bi, J ) 

FMULS ( rsumrO, arO, brO ) 

FMULS ( rsumiO, aiO, biO ) 

FMULS ( isumiO, arO, biO ) 

FMULS ( isumrO, aiO, brO ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 

FMADDS ( rsumrO, ar2, br2, rsumrO ) 

FMADDS ( rsumiO, ai2, bi2, rsumiO ) 

FMADDS ( isumiO, ar2, bi2, isumiO 

FMADDS ( isumrO, ai2, br2 , isumrO ) 
BR (combine) 

LABEL (do 2) 



LFSUX( arl, Ar, I ) 
SLWK J, J, 2 ) 
LFSUX( ail, Ai, I ) 

LFS( brO, Br, 0 ) 

LFS( biO, Bi, 0 ) 

LFSUX( brl, Br, J ) 

LFSUX( bil, Bi, J ) 

FMULS ( rsumrO, arO, brO ) 

FMULS ( rsumiO, aiO, biO ) 

FMULS ( isumiO, arO, biO ) 

FMULS ( isumrO, aiO, brO ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 
BR (combine) 

LABEL (dol) 

LFS( aiO, Ai, 0 ) 

LFS( biO, Bi, 0 ) 

LFS( brO, Br, 0 ) 

LFS( arO, Ar, 0 ) 

FMULS ( rsumiO, aiO, biO) 

FMULS ( isumrO, aiO, brO) 

FMSUBS( rsumrO, arO, brO, rsumiO) 

STFS( rsumrO, Cr, 0 ) 

FMADDS ( isumiO, arO, biO, isumrO) 

STFS( isumiO, Ci , 0 ) 

/** 
return 

W 

LABEL (ret) 

REST fl4 fl9 

REST rl3__rl4 

RETURN 
FUNC EPILOG 



LFS( arO, Ar, 0 ) 
SLWK I, I, 2 ) 
LFS( aiO, Ai, 0 ) 



/* byte strides */ 
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~MC Standard" Algorithms -- PPC Macro language Version 



File Name: GEN R SUMS . MAC . . ,„ e 

Description: Multiple small dot product routine for wireless 
group application. 

GEN^SUMS m (X_bf. Coorjbf, Ptov_map, R_sums, Num_phys_users) 
I Formula: 

j num sums =0; . 
for~( i = 0; i < Num phys users; i++ ) \ 
for ( j = 0; j < (int) Ptov_map[i] ; j++ ) { 
sum =0; 

for < k = 0; k < 16; k++ ) { 

sum + = (BF32)X bf[k].real * (BF32)Corr bf->real; 
sum += (BF32)X_bf [k] . imag * (BF32) Corr_bf ->imag; 
++Corr_bf ; 

} 

*R sums++ = sum; 
++num_sums; 

X_bf += N_FINGERS_MAX_SQUARED ; 

^ Mercury Computer Systems, Inc. 

Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000906 fpl Created 



#include "salppc . inc" 

tfdefine DO 10 1 
tfdefine DO_PRE FETCH 1 

#if DO 10 

tfdefine PTOV BUMP 1 1 

#define CORR BUMP 32 32 

#define CORR BUMP_64 64 

#define X BUMP 64 64 

#define RSUM BUMP 8 8 

tfdefine RSUM_BUMP_4 4 
tfelse 

tfdefine PTOV BUMP 1 0 

#define CORR BUMP 32 0 

#define CORR BUMP_64 0 

^define X BUMP 64 0 

#define RSUM BUMP 8 0 

#define RSUM_BUMP_4 0 
#endif 

#define LOAD_CORR( vT, rA, rB ) LVX ( vT, rA, 

#define DST_BUMP CORR_BUMP_64 



#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 
DST( rA, rB, STRM ) \ 
ADDI { rA, rA, DST_BUMP ) 

#else 

#define PREFETCH ( rA, rB, STRM ) 
tfendif 
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Input parameters 
**' 

#define X bf r3 

#define Corr bf r4 

#define Ptov map r5 

#define R sump r6 
#define Num_j>hys_users r7 

/** 
Local GPRs 

**/ 

tfdefine icount r8 
^define ptov count r9 
tfdefine indxl rlO 
#define indx2 rll 
ttdefine indx3 rl2 
^define sindexl rl3 
^define dstp rl4 
tfdefine dst_code rl5 
/** 

G4 registers 
**/ 

^define corrOO vO 
#define corrOl vl 
tfdefine corrlO v2 
^define corrll v3 



#define CO 0 v4 
tfdefine CI 0 v5 
#define CO 8 v6 
#define Cl_8 v7 

fldefine CO 16 v8 
#define CI 16 v9 
#define CO 24 vlO 
tfdefine Cl_24 vll 

#define XO vl2 

#define X8 vl3 

#define X16 vl4 

^define X24 vl5 

tfdefine sumO vl6 
tfdefine suml vl7 
#define zero vl8 

I ** 

Begin code text 
**/ 

ENTRY^M ^gen_R_sums , X_bf, Corr_bf, Ptovmap, R_sump, Num_phys_users ) 

CMPWK Num_phys_users, 0 ) 

BGT ( start ) 

RETURN 



LABEL ( start ) 



SAVE rl3 rl5 

USE_THRU_vl8 ( VRSAVE_COND ) 

/** 
DST setup 

**/ 

MAKE STREAM CODE IIR( dst_code, DST__BUMP, 1,0) 
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ADDK dstp, Corr bf, 80 ) 
PREFETCH ( dstp, dst_code, 0 ) 

/** 

Setup for outer loop entry 
Read and expand two coor vectors 
Set outer loop counter condition 
**/ 

LI( indxl, 16 ) 
LI( indx2, 32 ) 
LI( indx3, 48 ) 
LI ( sindexl, 4 ) 
CMPWI CR( OLOOP_BIT, Num phys_users, 0 ) 
LVX( corrOO, 0, Corr bf ) 
VX0R{ zero, zero, zero ) 
LVX( corrOl, Corr bf, indxl ) 
LVX( corrlO, Corr bf, indx2 ) 
LVX( corrll, Corr_bf, indx3 ) 

VUPKHSB( CO 0, corrOO ) 

ADDK Corr bf, Corr bf, CORR_BUMP_64 ) 

VUPKLSB ( CO 8, corrOO ) 

ADDK Ptov map, Ptov map, - PTOV__BUMP_l ) 
VUPKHSB{ CI 0, corrlO ) 
ADDK R sump, R sump, -RSUM_BUMP_8 ) 
VUPKLSB ( CI 8, corrlO ) 
VUPKHSB( CO 16, corrOl ) 
VUPKLSB { CO 24, corrOl ) 
VUPKHSB{ CI 16, corrll ) 
VUPKLSB ( Cl_24, corrll ) 
/** 

Outer loop for each physical user 

* * I 

LABEL ( oloop ) 

/* { */ 

DECR { Num phys users ) 

LBZU( ptov count, Ptov_map, l ) 

BEQ CR( OLOOP BIT, ret ) 

LVX( X0, 0, X bf ) 

LVX( X8, X bf , indxl } 

SRWI_C( icount, ptov count, 1 ) 

LVX( X16, X bf , indx2 ) 

LVX( X24, X_bf, indx3 ) 

ADDK X bf, X bf, X BUMP 64 ) 

CMPWI CR( OLOOP BIT, Num_j>hys_users , 0 ) 

BEQ_MINUS( one_sum ) 

/★* 

Top of sum loop 

Produces two sums each pass 

* */ 

LABEL { iloop ) 
/* { */ 

PREFETCH { dstp, dst code, 0 ) 
VMSUMSHS{ sumO, CO 0, X0, zero ) 
VMSUMSHS( suml, CI 0, X0, zero ) 
LVX{ corrOO, 0, Corr bf ) 
LVX( corrOl, Corr bf, indxl ) 
LVX( corrlO, Corr bf, indx2 ) 
VMSUMSHS( sumO, C0_8 , X8, sumO ) 
DECR C( icount ) 

VMSUMSHS( suml, CI 8, X8, suml ) 
LVX( corrll, Corr bf, indx3 ) 
VUPKHSB{ CO 0, corrOO ) 
VUPKLSB { CO 8, corrOO ) 
VMSUMSHS( sumO, CO 16, XI 6, sumO ) 
VMSUMSHS( suml, CI 16, X16, suml ) 
VUPKHSB( C1_0, corrlO ) 



2/23/2001 
/* start prefetch advanced */ 
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ADDI ( R sump, R sump, RSUM_BUMP_8 ) 

VUPKLSB( CI 8, corrlO ) 

VMSUMSHS( sumO, CO 24, X24 , sumO ) 

VUPKHSB( CO 16, corrOl ) 

VMSUMSHS( suml, CI 24, X24, suml ) 

VUPKLSB( CO 24, corrOl ) 

VUPKHSB( CI 16, corrll ) 

VSUMSWS( sumO, sumO, zero ) 

VUPKLSB( CI 24, corrll ) 

VSUMSWS( suml, suml, zero ) 

ADDI ( Corr bf, Corr_bf, CORR_BUMP_64 ) 

VSPLTW( sumO, sumO, 3 ) 

STVEWX( sumO, 0, R sump ) 

VSPLTW( suml, suml, 3 ) 

STVEWX( suml, R_sump, sindexl ) 

/* ) */ 

BNE ( iloop ) 

'** , 
Drop out, check for remainders 

★ ★/ 

ANDI_C(icount, ptov_count, 0x1) 
BEQ ( oloop ) 

/** 

Enters r and U exits with two coor vectors are loaded and expanded to 16 bit 
**/ 

LABEL ( one sum ) 

VMSUMSHS( sumO, CO 0, X0, zero ) 
VMSUMSHS( sumO, CO 8, X8, sumO ) 
ADDI ( R sump, R_sump, RSUM BUMP 8 ) 
VMSUMSHS( sumO, CO 16, XI 6, sumO ) 
VMSUMSHS( sumO, CO 24, X24 , sumO ) 
VSUMSWS( sumO, sumO, zero ) 



/* pre-dec pointer for loop reentry 



/ 



VSPLTW( sumO, sumO, 3 ) 
STVEWX ( sumO , 0 , R sump ) 
ADDI ( R_sump, R_sump, -RSUM_BUMP_4 ) 

«*' 

Seup for loop re-entry: corrOO consumed in one_sum section 

loop exit ptr v 
corrOO corrlO corrOO corrlO 

corrOO corrlO corrOO corrlO 
loop re-entry ptr 

*/ 

VMR{ corrOO, corrlO ) 
LVX( corrlO, 0, Corr bf ) 



VMR{ corrOl 
LVX( corrll 
ADDI ( Corr__bf 



corrll ) 
Corr bf , indxl ) 

Corr bf, CORR_BUMP__3 2 ) 



VUPKHSB( CO 0, corrOO ) 

VUPKLSB( CO 8, corrOO ) 

VUPKHSB( CI 0, corrlO ) 

VUPKLSB( CI 8, corrlO ) 

VUPKHSB( CO 16, corrOl ) 

VUPKLSB( CO 24, corrOl ) 

VUPKHSB( CI 16, corrll ) 

VUPKLSB< Cl_24, corrll ) 

/* ) */ 

BR ( oloop ) 

/** 

Exit routine 
**/ 

LABEL ( ret ) 

FREE THRU vl8 ( VRSAVE_COND ) 
REST rl3 rl5 
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RETURN 
FUNC EPILOG 
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MC 



Standard Algorithms -- PPC Macro language Version 



File Name: 
Description: 

Entry/params 



GEN R SUMS 2 . MAC „ . _ 

Multiple small dot product routine for wireless 
group application. 
GEN R SUMS2 (X bf, CorrO bf , Corrl bf , 

Ptovjnap, R_sums0, R_sumsl, Num_j>hys_users) 



Formula : 

num sums =0; . 
for ( i = 0; i < Num phys users; i++ ) \ 
for ( j = 0; j < (int)Ptov_map[i] ; j++ ) { 
sum = 0; 

for ( k = 0; k < 16; k++ ) \ 

sumO + = <BF32)X bf [k] - real * (BF32)CorrO bf->real; 
sumO += (BF32)X_bf fk] .imag * (BF32) Corr0_bf ->imag; 

suml += (BF32)X bf[k].real * (BF32)Corrl bf->real; 
suml += (BF32)X_bf [k] .imag * (BF32) Corrl_bf ->imag; 
++Corr0 bf; 
+4Corrl_bf ; 

} 

*R sums0++ = sumO; 
+R sumsl++ = suml; 
+ + num_sums ; 

X_bf += N_FINGERS_MAX_SQUARED; 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



} 



Revision 



Date 



Engineer Reason 



0 0 000906 fpl Created 

Q1 000908 fpl Fixed zero bug 



^include "salppc . inc" 

tfdefine DO 10 1 
#define DO__PREFETCH 1 



#if DO 10 
ttdefine 
#define 
#def ine 
#define 
fldefine 
tfdefine 
flelse 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#endif 



PTOV BUMP 1 1 
CORR BUMP 32 32 
CORR BUMP_64 64 
X BUMP 64 64 
RSUM BUMP 8 8 
RSUM BUMP_4 4 



PTOV BUMP 1 
CORR BUMP 3 2 
CORR BUMP_64 
X BUMP 64 0 
RSUM BUMP 8 
RSUM BUMP 4 



0 
0 

0 
0 



#define LOAD_CORR( vT, rA, rB ) LVX ( vT, rA, rB ) 
#define DST_BUMP CORR_BUMP_64 
#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 
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DST( rA, rB, STRM ) \ 
ADDI ( rA, rA, DST_BUMP ) 
Helse 

^define PREFETCH ( rA, rB, STRM ) 
#endif 



tfdefine OLOOP_BIT 6 



/** 

Input parameters 
**' 

ttdefine X bf r3 

^define CorrO bf r4 

tfdefine Corrl bf r5 

tfdefine Ptov map r6 

^define R sumpO r7 

tf define R sumpl r8 
#define Num_phys_users r9 

/** 
Local GPRs 

**/ 

tfdefine icount rlO 
tfdefine ptov count rll 
^define indxl rl2 
tfdefine indx2 rl3 
tfdefine indx3 rl4 
#define sindexl rl5 
#define dstp rl6 
#define dst code rl7 
tfdefine dst_stride indx3 
/** 

G4 registers 
**/ 

^define corrOO vO 
tfdefine corrOl vl 
#define corrlO v2 
#define corrll v3 
ttdefine corr20 v4 
tfdefine corr21 v5 
#define corr30 v6 
tfdefine corr31 corrOO 
fldefine zero v7 



#def ine 
#def ine 
^define 
#def ine 

#def ine 
#def ine 
#define 
#def ine 

#def ine 
^define 
#def ine 
#def ine 

#def ine 
#def ine 
#define 
#define 

#define 
tfdefine 
tfdefine 
#define 



CO 0 v8 
CI 0 v9 
C2 0 vlO 
C3_0 vll 

CO 8 vl2 
CI 8 vl3 
C2 8 vl4 
C3_8 vl5 

CO 16 vl6 
CI 16 vl7 
C2 16 vl8 
C3_16 vl9 

CO 24 v20 
CI 24 v21 
C2 24 v22 
C3_24 v23 

X0 v24 
X8 v25 
X16 v26 
X24 v27 
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tfdefine sumO v28 
fldefine suml v29 
tfdefine sum2 v30 
fldefine sum3 v31 
/** 

Begin code text 
**/ 

FUNC_PROLOG 

^Jp 1 /****** alignment may be important ******/ 

tfendif 

ENTRY_7( gen R sums2, X_bf. CorrO_bf, Corrl_bf, Ptovmap, R_sumpO, R_sumpl, 
Num_phys_users ) 

CMPWK Num_phys_users, 0 ) 

BGT ( start ) 

RETURN 

LABEL ( start ) 
SAVE rl3 rl7 

USE_THRU_v31 ( VRSAVE_COND ) 

/** 

DST setup 

**/ 

SUB( dst stride, Corrl bf, CorrO bf ) 

MAKE STREAM CODE IIR( dst code, DST_BUMP 2, dst_stride ) 

addt( dsto "CorrO bf, 80 ) /* start prefetch advanced */ 

/* 48: ' 1087 64: 1094, 80: 1043, 96: 1058, 112: 1049, 128: 1061 */ 
PREFETCH ( dstp, dst_code, 0 ) 

/** 

Setup for outer loop entry 
Read and expand two coor vectors 
Set outer loop counter condition 
**/ 

LI( indxl, 16 ) 
LI( indx2, 32 ) 
LI( indx3, 4 8 ) 
LI ( sindexl, 4 ) 

CMPWI_CR( OLOOP_BIT, Num_phys_users, 0 ) 

LOAD CORR( corrOO, 0, CorrO bf ) 

LOAD CORR{ corrlO, CorrO bf , indx2 ) 

ADDI ( Ptov_map, Ptov map, -PTOV BUMP_1 ) 

LOAD CORR( corr20, 0, Corrl bf ) 

ADDI ( R sumpO, R sumpO, -RSUM BUMP_8 ) 

LOAD_CORR( corr30, Corrl_bf, indx2 ) 

LOAD CORR( corrOl, CorrO bf , indxl ) 
ADDI ( R sumpl, R sumpl , -RSUM BUMP_8 ) 
LOAD CORR( corrl 1, CorrO_bf, indx3 ) 
VXOR( zero, zero, zero ) 
LOAD_CORR( corr21, Corrl_bf, indxl ) 

VUPKHSB( CO 0, corrOO ) 

ADDI ( CorrO bf , Corr0_bf , CORR_BUMP_64 ) 
VUPKHSB( CI 0, corrlO ) 
VUPKHSB( C2 0, corr20 ) 
VUPKHSB( C3__0, corr30 ) 

^oiR(°corrirCoril_bf. indx3 , /* corrOO, corr 31 same register V 
VUPKLSB( CI 8, corrlO ) 

ADDI ( Corrl__bf, Corrl_bf, CORR_BUMP_64 ) 
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VUPKLSB ( 
VUPKLSB ( 



C2 8 f 
C3 8, 



corr20 ) 
corr30 ) 



VUPKHSB( CO 16, corrOl ) 

VUPKHSB( CI 16, corrll ) 

VUPKHSB( C2 16, corr21 ) 

VUPKHSB( C3_16, corr31 ) 



VUPKLSB ( CO 24, 
VUPKLSB ( CI 24, 
VUPKLSB ( C2 24, 
VUPKLSB ( C3_24, 



corrOl ) 
corrll ) 
corr21 ) 
corr31 ) 



/* 

Outer loop for each physical user 
**/ 

LABEL ( oloop ) 
/* { */ 



DECR ( Num phys users ) 
LBZU( ptov count, Ptov_map, 
BEQ CR( OLOOP BIT, ret ) 
LVX( X0, 0, X bf ) 
LVX( X8, X bf , indxl ) 
SRWI_C( icount, ptov count, 
LVX( X16, X bf, indx2 ) 
LVX( X24, X_bf, indx3 ) 
ADDK X bf , X bf , X BUMP 64 
CMPWI CR( OLOOP BIT, Num _phys_users 
BEQ_MINUS( one_sum ) 



PTOV_BUMP_l ) 



1 ) 



) 



0 ) 



Top of sum loop 

Produces four sums each pass 

**/ 

LABEL ( iloop ) 
/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
LOAD CORR( corrOO, 0, Corr0_bf ) 
DECR C( icount ) 

LOAD CORR( corrlO, CorrO bf , indx2 ) 
VMSUMSHS( sumO, C0_0, X0, zero ) 
LOAD CORR( corr20, 0, Corrl bf ) 
VMSUMSHS( suml, C1_0, X0 , zero ) 
LOAD CORR( corr30, Corrl bf, indx2 ) 
LOAD CORR( corrOl, CorrO bf, indxl ) 
VMSUMSHS( sum2, C2_0, X0 , zero ) 
LOAD CORR( corrll, CorrO bf, indx3 ) 
LOAD CORR( corr21, Corrl bf, indxl ) 
VMSUMSHS( sum3, C3 0, X0 , zero ) 
VUPKHSB( CO 0, corrOO ) 
VMSUMSHS( sumO, CO 8, X8 , sumO ) 
VUPKHSB( CI 0, corrlO ) 
ADDK R sumpO, R sumpO , RSUM_BUMP_8 ) 
VUPKHSB( C2 0, corr20 ) 
VMSUMSHS( suml, CI 8, X8, suml ) 
VUPKHSB( C3 0, corr30 ) 
VMSUMSHS( sum2, C2 8, X8, sum2 ) 
VUPKLSB ( CO 8, corrOO 
VMSUMSHS( sum3, C3 8, 



X8, 



sum3 ) 
ru di , CORR BUMP 64 ) 
LOAD CORR( corr31, Corrl_bf, indx3 ) /* corrOO, corr31 same register */ 
VUPKLSB ( CI 8, corrlO ) 



ADD I ( CorrO bf, CorrO bf, 



CO 16, X16, sumO ) 
corr20 ) 

CI 16, X16, suml ) 
corr30 ) 

R sumpl, RSUM_BUMP_8 ) 
VUPKHSB(' CO' 16, corrOl ) 
VMSUMSHS( sum2, C2_16, X16, sum2 ) 



VMSUMSHS( sumO 
VUPKLSB ( C2 8, 
VMSUMSHS( suml 
VUPKLSB ( C3 8, 
ADD I ( R sumpl 
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VUPKHSB( CI 16, 
VMSUMSHS( sum3, 
VUPKHSB( C2 16, 
VMSUMSHS( sumO 
ADDI ( Corrl bf 
VMSUMSHS( suml 
VUPKHSB ( C3 16 
VMSUMSHSt sum2 
VUPKLSB ( CO 24 
VMSUMSHSC sum3 
VSUMSWS{ sumO, 
VUPKLSB ( CI 24 
VSUMSWS( suml, 
VUPKLSB ( C2 24 
VSUMSWS( sum2, 
VUPKLSB ( C3 24 
VSPLTW( sumO, 
VSUMSWS{ sum3, 
VSPLTW( suml, 
STVEWX( sumO, 
VSPLTW( sum2, 
STVEWXC suml, 
VSPLTW( sum3, 
STVEWX( sum2, 
STVEWX( sum3, 

BNE ( iloop ) 



2/23/2001 



corrl 1 ) 

C3 16, X16, sum3 ) 
corr21 ) 

CO 24, X24, sumO ) 
Corrl bf, CORR BUMP_64 ) 
CI 24, X24, suml ) 
corr31 ) 

C2 24, X24, sum2 ) 
corrOl ) 

C3 24, X24, sum3 ) 
sumO, zero ) 

corrll ) 
suml , zero ) 

corr21 ) 
sum2 , zero ) 
corr31 ) 
sumO, 3 ) 

sum3, zero ) 
suml , 3 ) 
0, R sumpO ) 
sum2, 3 ) 

R sumpO, s index! ) 
sum3 , 3 ) 
0, R sumpl ) 
R_sumpl, sindexl ) 



Drop out, check for remainders 

★ */ 

ANDI_C(i count, ptov_count, 0x1) 
BEQ ( oloop ) 

/** 

Enters r and U exits with two coor vectors are loaded and expanded to 16 bit 

★ * J 

LABEL ( one sum ) 

VMSUMSHS( sumO, CO 0, X0, zero ) 
ADDI ( R sumpO, R sumpO, RSUM BUMP_8 ) 
VMSUMSHS( sum2, C2 0, X0, zero ) 
ADDI ( R sumpl, R sumpl, RSUM BUMP_8 ) 
VMSUMSHS( sumO, CO 8, X8, sumO ) 



VMSUMSHS( sum2, 
VMSUMSHS( sumO, 
VMSUMSHS( sum2, 
VMSUMSHS( sumO, 
VMSUMSHS( sum2, 
VSUMSWS( sumO, sumO, zero ) 
VSUMSWS ( sum2, sum2, zero ) 



C2 8, X8, sum2 ) 

CO 16, X16, sumO ) 

C2 16, X16, sum2 ) 

CO 24, X24, sumO ) 

C2 24, X24, sum2 ) 



/ 



VSPLTW( sumO, sumO, 3 ) 

STVEWX( sumO, 0, R sumpO ) 

VSPLTVM sum2, sum2 , 3 ) 

STVEWX{ sum2, 0, R sumpl ) 

ADDI { R_sump0, R_sump0, -RSUM_BUMP_4 

reentry */ 

ADDI ( R_sumpl, R_sumpl, -RSUM_BUMP_4 

★ 

Setup for loop re-entry: 

exit ptr v 
corrOO corrlO corrOO 
corrOO corrlO 
re-entry ptr 



) /* pre-dec pointers for loop 



corrOO consumed in one_sum section 



corrlO 
corrOO 



corrlO 



*/ 



VMR( corr21, corr31 ) 
VMR( corrOO, corrlO ) 



/* corrOO, corr31 same register */ 
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LOAD_CORR< corrlO, 0, CorrO_bf ) 

VMR( corrOl, corrll ) 

LOAD_CORR( corrll, Corr0_bf, indxl ) 

VMR( corr20, corr30 ) 

LOAD_CORR( corr30, 0, Corrl_bf ) 

VUPKHSB( CO 0, corrOO ) 

VUPKLSB( CO 8, corrOO ) _ . w 

LOAD CORR( corr31, Corrl_bf, indxl ) /* corrOO, corr31 same register */ 

VUPKHSB( CI 0, corrlO ) 

VUPKLSB( CI 8, corrlO ) 

VUPKHSB( C2 0, corr20 ) 

VUPKLSB( C2 8, corr20 ) 

VUPKHSB( C3 0, corr30 ) 

VUPKLSB( C3_8, corr30 ) 

VUPKHSB( CO 16, corrOl ) 

ADDI ( CorrO bf, CorrO bf, CORR__BUMP_32 ) 

VUPKLSB( CO 24, corrOl ) 

ADDI ( Corrl bf, Corrl bf , CORR_BUMP_32 ) 

VUPKHSB( CI 16, corrll ) 

VUPKLSB{ CI 24, corrll ) 

VUPKHSB( C2 16, corr21 ) 

VUPKLSB{ C2 24, corr21 ) 

VUPKHSB( C3 16, corr31 ) 

VUPKLSB( C3_24, corr31 ) 

/* I */ 

BR ( oloop ) 
/** 

Exit routine 
★ */ 

LABEL ( ret ) 

FREE THRU v31 ( VRSAVE_COND ) 

REST rl3_rl7 

RETURN 
FUNC EPILOG 



6 



Page No. 230 



EV 093 931 868 US 
Page No. 257 

gen_x_row.mac 



2/23/2001 



--- MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN X ROW. MAC 

Description: 2 Complex scalers (4x1) 2 complex vectors (4xN) 
16 bit complex multiplication producing a 16 
bit complex vector of length 16*N. 

Entry/params : GEN_X_R0W (Al, A2, C, Phys_index, N) 
Formula : 

for ( i = 0; i < tot_phys_users; i++ ) { 

in mpathlp = mpathl bf + (i * N FINGERS MAX) ; 
in_mpath2p - mpath2_bf + (i * N_FINGERS_MAX) ; 

for ^'ql = 0; ql < N_FINGERS_MAX; ql++ ) { 

sir = (BF32)out mpathlp [ql] .real; 

sli = (BF32)out mpathlp [ql] .imag; 

s2r = (BF32)out mpath2p[ql] .real; 

s2i = (BF32)out_mpath2p[ql] .imag; 

for ( q = 0; q < N_FINGERS_MAX; q++ ) { 

air = (BF32)in mpathlp [q] . real ; 

ali = (BF32)in mpathlp [q] .imag; 

a2r = (BF32)in mpath2p [q] .real; 

a2i = (BF32) in_mpath2p[q] .imag; 

cr = (air * sir) + (ali * sli) ; 

ci = (air * sli) - (ali * sir) ; 

cr += (a2r * s2r) + (a2i * s2i) ; 

ci += (a2r * s2i) - (a2i * s2r) ; 

X bf [i * N FINGERS_MAX_SQUARED + j].real 

- _ " = (BF16) (cr » 16) ; 
X bf [i * N FINGERS MAX SQUARED + j] .imag 

- - = (BF16) (ci » 16) ; 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 
0.0 



Date Engineer Reason 
000907 fpl Created 



^include "salppc.inc" 

#define LOG N FINGERS MAX 2 
iidefine LOG ELEMENT SIZE 2 

#define INDEX_SHIFT"" (LOG_N^FINGERS_MAX + LOG_ELEMENT_SIZE) 



/ 



Local read-only Permute vector table 



RODATA SECTION ( 6 ) 
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START_L_ARRAY { local_table ) 

L_PERMUTE_MASK( 0x02031011, 0x06071415, 0x0a0bl819, OxOeOflCld ) 

^32 -> 16 bit: select the 16 MSBs of each 32 bit field 
L^PERMUTE_MASK ( 0x00011011, 0x04051415, 0x08091819, OxOcOdlcld ) 
END_ARRAY 
/" 

API registers 
**/ 

fldefine Al r3 
#define A2 r4 
#define C r5 
#define Phys_index r6 
tfdefine N r7 
/** 

Integer loop registers 
**/ 

tfdefine CpO C 
#define Cpl r8 
tfdefine sptrl r8 
#define Cp2 r9 
#define sptr2 r9 
tfdefine Cp3 rlO 
#define tptr rlO 
#define cindex rll 
#define aindex rl2 
tfdefine index rl2 

/* + 

G4 registers 
**/ 

#define crOO vO 
#define crOl vl 
#define cr02 v2 
#define cr03 v3 

#define vtmpO vO 
#define vtmp2 v2 

tfdefine ciOO v4 
tfdefine ciOl v5 
#define ci02 v6 
tfdefine ci03 v7 

#define srOO v8 
fldefine srOl v9 
^define sr02 vlO 
tfdefine sr03 vll 

#define siOO vl2 

#define siOl vl3 

#define si02 vl4 

#define si03 vl5 

#define srlO vl6 
#define srll vl7 
#define srl2 vl8 
#define srl3 vl9 

#define silO v20 
#define sill v21 
tfdefine sil2 v22 
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#define sil3 v23 

tfdefine cO v24 
tfdefine cl v24 
tfdefine c2 v25 
#define c3 v26 

#define aOO v27 
#define al.O v27 
#define aOl v28 
#define all v29 

#define sval v28 
#define neg_sval v29 

fldefine vc v30 
fldefine zero v31 

/** 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY 5( gen X row, Al, A2, C, Phys_index, N ) 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

Load up complex scaler 

sval = srO siO srl sil sr2 si2 sr3 si3 
**/ 

LA ( tptr, local table, 0 ) 
VXOR( zero, zero, zero ) 
LI (index, 0) 

/** 

Byte offset into 16 bit complex vector 
**/ 

SLWI( Phys index, Phys index, INDEX_SHIFT ) 
ADD { sptrl, Al, Phys index ) 
ADD ( sptr2, A2, Phys_index ) 

/** 

Load up first scaler: 

if sval = sr0,si0 srl, sil sr2,si2 sr3,si3 
= sO si s2 s3 

**/ 

LVX( sval, sptrl, index ) /* read 4 16 bit complex values */ 
VSUBSHS( neg sval, zero, sval ) /* negate complex scaler values */ 
VMRGHW(vtmpO, sval, sval) /* vtmpO = sO sO si si */ 

VMRGLW (vtmp2 , sval, sval) /* vtmp2 = s2 s2 s3 s3 */ 

VMRGHW(srOO, vtmpO , vtmpO) /* srO * sO sO sO sO */ 
VMRGLW (srOl, vtmpO, vtmpO) /* srl = si si si si +/ 
VMRGHW(sr02, vtmp2, vtmp2) /* sr2 = s2 s2 s2 s2 */ 
VMRGLW(sr03, vtmp2 , vtmp2) /* sr3 = s3 s3 s3 s3 */ 

/** 

if neg sval = srO,siO srl, sil sr2,si2 sr3,si3 
after perm: 

= siO,-srO sil, -srl si2,-sr2 si3,-sr3 

» nsO nsl ns2 ns3 

**/ 

LVX( vc, tptr, index ) 

VPERM( neg sval, sval, neg sval, vc ) /* si -sr */ 
VMRGHW(vtmpO, neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW {vtmp2, neg sval, neg sval) /* vtmp2 = ns2 ns2 ns3 ns3 */ 
VMRGHW(siOO, vtmpO , vtmpO) /* siO = nsO nsO nsO nsO */ 
VMRGLW ( s i 0 1 , vtmpO, vtmpO) /* sil = nsl nsl nsl nsl */ 
VMRGHW(si02, vtmp2 , vtmp2) /* si2 = ns2 ns2 ns2 ns2 */ 
VMRGLW(si03, vtmp2, vtmp2) /* si3 = ns3 ns3 ns3 ns3 */ 

/** 

Load up second scaler: 
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LVX( sval, sptr2, index ) /* read 4 16 bit complex values */ 
SKiSc^SSE -ro. sval ) /* negate complex scaler values */ 



VMRGHW ( vtmpO , sval, sval) 

VMRGLW(vtmp2, sval, sval) 

VMRGHW(srlO, vtmpO, vtmpO) 

VMRGLW(srll, vtmpO, vtmpO) 

VMRGHW (srl2, vtmp2, vtmp2) 

VMRGLW(srl3, vtmp2, vtmp2) 



/* vtmpO = sO sO si si 
/* vtmp2 = s2 s2 s3 s3 
/* srO = sO sO sO sO */ 
/* srl = si si si si */ 
/* sr2 = s2 s2 s2 s2 */ 
/* sr3 = s3 s3 s3 s3 */ 



VPERM( neg sval, sval, neg sval, vc ) /* si -sr */ 
WlRGHW(vtmpO, neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW(vtmp2, neg sval, neg sval) /* vtmp2 = n ? 2 ns2 ns3 ns3 */ 
VMRGHW(silO, vtmpO, vtmpO) 



/** 



VMRGLW(sill, vtmpO, vtmpO) 
VMRGHW (si 12, vtmp2, vtmp2) 
VMRGLW(sil3, vtmp2, vtmp2) 



/* sib = nsO nsO nsO nsO */ 

/* sil = nsl nsl nsl nsl */ 

/* si2 = ns2 ns2 ns2 ns2 */ 

/* si3 = ns3 ns3 ns3 ns3 */ 



Assiqn loop pointers and index registers: 
Loop permute control vector assumes 16 bit input vectors 
C[] -> 16 x N complex elements 
A[] -> 4 x N complex elements 

N -> 4 byte (i.e. interleaved complex) elements 

W LVX( vc, tptr, index ) /* interleaves 16 MSBs of real, imaginary */ 
LKaindex, 0) 
LKcindex, 0) 
ADDK Cpl, C, 16 ) 
ADDI ( Cp2, C, 32 ) 
ADDK Cp3, C, 48 ) 

/** 

Start up loop code: _ n _ 

Each read on A [3 brings in 4 complex input values 

**/ 

LVX( aOO, Al, aindex ) 
DECR_C(N) 

LVX( aOl, A2, aindex ) 
ADDI (aindex, aindex, 16) 



VMSUMSHS( crOO, 
VMSUMSHS ( ciOO, 
VMSUMSHS( crOl, 
VMSUMSHS( ciOl, 
VMSUMSHS( cr02, 
VMSUMSHS( ci02, 
VMSUMSHS( cr03, 
VMSUMSHS( Ci03, 
BEQ ( dol ) 



srOO, aOO, zero ) 

siOO, aOO, zero ) 

srOl, aOO, zero ) 

siOl, aOO, zero ) 

sr02, aOO, zero ) 

si02, aOO, zero ) 

sr03, aOO, zero ) 

si03, aOO, zero ) 



DECR_C(N) 

LVX( alO, Al, aindex ) /* 
VMSUMSHS( crOO, srlO, aOl, 
VMSUMSHS( ciOO, silO, aOl, 
LVX( all, A2, aindex ) 
VMSUMSHS( crOl, srll, aOl, crOl ) 
BR ( mid_loopO ) 

/** 

Top of double loop 
**/ 

LABEL ( loopO ) 



read input for next pass 
crOO ) 
ciOO ) 



VMSUMSHS ( 
VMSUMSHS ( 
VPERM( c2 
STVX( cl, 
VMSUMSHS ( 



zero ) 
zero ) 



crOO, srOO, aOO, 
ciOO, siOO, aOO, 
, cr02, ci02, vc ) 
Cpl, cindex ) 
crOl, srOl, aOO, zero ) 
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DECR C(N) 

VMSUMSHS( ciOl, siOl, aOO, 
VMSUMSHS( cr02, sr02, aOO, 
VMSUMSHS( ci02, si02, aOO, 
VPERM ( c3, cr03, ci03, vc ) 
STVX( c2, Cp2 # cindex ) 
VMSUMSHS( cr03, sr03, aOO, 
VMSUMSHS( ci03, si03, aOO, 
LVX( alO, Al, aindex ) /* 
VMSUMSHS( crOO, srlO, aOl, 
VMSUMSHS( ciOO, silO, aOl, 
LVX( all, A2, aindex ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS( crOl, srll, aOl, 
ADDI (cindex, cindex, 64) 
LABEL ( mid loopO ) 

VMSUMSHS( ciOl, sill, aOl, 
VMSUMSHS( cr02, srl2, aOl, 
VPERM( CO, crOO, ciOO, vc ) 
STVX( cO, CpO, cindex ) /* 
VMSUMSHS ( ci02, sil2, aOl, 
ADDI (aindex, aindex, 16) 
VMSUMSHS( cr03, srl3 , aOl, 
VMSUMSHS ( ci03, sil3, 
VPERM( cl, crOl, CiOl 

/* } */ 

BNE ( loopl ) 

/** 

Drop out to flush 
**/ 

VMSUMSHS 
VMSUMSHS ( ciOO, 
VPERM ( c2 
STVX( Cl, 
VMSUMSHS ( crOl 
VMSUMSHS ( CiOl 
VMSUMSHS ( cr02 
VMSUMSHS ( Ci02 
VPERM ( c3 
STVX( c2 



aOl, 
vc ) 



zero ) 
zero ) 
zero ) 



zero ) 
zero ) 

read input for next pass */ 
crOO ) 
ciOO ) 



crOl ) 



ciOl ) 

cr02 > */ 
/* begin permute cycle for this pass */ 

begin write cycle from last pass */ 

ci02 ) 

cr03 ) 
ci03 ) 



alO, 
alO, 
vc ) 



srOO, 
siOO, 
cr02, ci02, 
Cpl, cindex ) 

srOl, alO 
siOl, alO 
sr02, alO 
si02, alO 
ci03 



zero ) 
zero ) 



zero ) 
zero ) 
zero ) 
zero ) 



cr03 f 



Cp2, cindex ) 



vc ) 



VMSUMSHS ( cr03, sr03, alO, zero ) 
VMSUMSHS ( ci03, si03, alO, zero ) 
VMSUMSHS ( crOO, srlO, all, crOO ) 
VMSUMSHS( CiOO, silO, all, ciOO ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS ( crOl, srll, all, crOl ) 
ADDI (cindex, cindex, 64) 
VMSUMSHS ( ciOl, sill, all, 
VMSUMSHS ( cr02, srl2 , all, 
VPERM ( CO, crOO, ciOO, vc ) 
STVX( cO, CpO, cindex ) /* 
VMSUMSHS( ci02, sil2, all, 
VMSUMSHS ( cr03, srl3, all, 
VMSUMSHS ( ci03, sil3, all, 
VPERM ( cl, crOl, ciOl, vc ) 

VPERM ( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM ( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR ( ret ) 

/** 

Top of second loop 
**/ 

LABEL ( loopl ) 
/* { */ 



ciOl ) 
cr02 ) 

/* begin permute cycle for this pass 
begin write cycle from last pass */ 
ci02 ) 
cr03 ) 
ci03 ) 
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VMSUMSHS{ crOO, srOO, alO 
VMSUMSHS( ciOO, siOO, alO 
VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS{ crOl, srOl, alO, 
DECR C(N) 

VMSUMSHS( ciOl, siOl, alO, 
VMSUMSHS( cr02, sr02, alO, 
VMSUMSHS( ci02 f si02, alO, 
VPERM( C3, cr03, ci03, vc ) 
STVX C c2 , Cp2, cindex ) 
VMSUMSHS{ cr03, sr03, alO, 
VMSUMSHS( ci03, si03, alO, 
LVX{ aOO, Al, aindex ) /* 
VMSUMSHS( crOO, srlO, all, 
VMSUMSHS{ ciOO, silO, all, 
LVX( aOl, A2, aindex ) 
STVX( c3 f Cp3 # cindex ) 
VMSUMSHS( crOl, srll, all, 
ADDI (cindex, cindex, 64) 
VMSUMSHS( ciOl, sill, all, 
VMSUMSHS( cr02, srl2, all, 
VPERM( cO, crOO, ciOO, vc 
STVX( cO, CpO, cindex ) /* 
VMSUMSHS( ci02, sil2, all, 
ADDI (aindex, aindex, 16) 
VMSUMSHS( cr03, srl3, all, 
VMSUMSHS( ci03, sil3, all, 
VPERM( cl, crOl, ciOl, vc 

/* ) */ 

BNE ( loopO ) 

/** 
Flush loop 

**/ 

VMSUMSHS( crOO, srOO, aOO, 
VMSUMSHS( ciOO, siOO, aOO, 
VPERM( c2, cr02, ci02, vc 
STVX( cl, Cpl, cindex ) 
VMSUMSHS( crOl, srOl, aOO, 
VMSUMSHS( ciOl, siOl, aOO, 
VMSUMSHS( cr02, sr02, aOO, 
VMSUMSHS( ci02, si02, aOO, 
VPERM( c3, cr03, ci03, vc 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03 , aOO, 
VMSUMSHS( ci03, si03, aOO 
VMSUMSHS( crOO, srlO, aOl 
VMSUMSHSt ciOO, silO, aOl 
STVX( c3, Cp3, cindex ) 
VMSUMSHS( crOl, srll, aOl 
. ADDI (cindex, cindex, 64) 
VMSUMSHS( ciOl, sill, aOl 
VMSUMSHS( cr02, srl2, aOl 
VPERM( CO, crOO, ciOO, vc 
STVX( cO, CpO, cindex ) / 
VMSUMSHS{ ci02, sil2, aOl 
VMSUMSHS( cr03, srl3, aOl 
VMSUMSHS( Ci03, sil3, aOl 
VPERM( cl, crOl, ciOl, vc ) 

VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR ( ret ) 



zero ) 
zero ) 



zero ) 

zero ) 
zero ) 
zero ) 



zero ) 
zero ) 

read input for next pass */ 
crOO ) 
ciOO ) 



crOl ) 



ciOl ) 

cr02 ) ; • * / 

/* begin permute cycle for this pass */ 
begin write cycle from last pass */ 
ci02 ) 

cr03 ) 
ci03 ) 



zero ) 
zero ) 



) 

zero ) 
zero ) 
zero ) 
zero ) 

) 

zero ) 
zero ) 
crOO ) 
ciOO ) 

, crOl ) 

, ciOl ) 

. cr02 ] w 
) /* begin permute cycle for this pass */ 

* begin write cycle from last pass */ 

, ci02 ) 

, cr03 ) 

, ci03 ) 
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LABEL ( dol ) 
VMSUMSHS ( 
VMSUMSHS ( 
VMSUMSHS ( 
VMSUMSHS ( 
VMSUMSHS ( 
VPERM( CO, 
STVX( CO, 
VMSUMSHS ( 
VMSUMSHS ( 
VMSUMSHS ( 
VPERM( Cl 
VPERM( c2 
STVX( Cl, 
VPERM( C3 
STVX{ c2, 
STVX( c3, 
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cr00 # srlO, aOl, crOO ) 
ciOO, silO, aOl, ciOO ) 
crOl, srll, aOl, crOl ) 
ciOl, sill, aOl, ciOl ) 
cr02, srl2, aOl, cr02 ) 

crOO, ciOO, vc ) /* begin permute cycle for this pass */ 
CpO, cindex ) /* begin write cycle from last pass */ 
aOl, ci02 ) 
aOl, cr03 ) 
aOl, ci03 ) 
vc ) 
vc ) 



ci02, sil2, 
cr03, srl3, 
ci03, sil3, 
crOl, ciOl, 
cr02, C102, 
Cpl, cindex ) 
, cr03, ci03, vc 
Cp2, cindex ) 
Cp3, cindex ) 



) 



/** 
Return 

W 

LABEL ( ret ) 

FREE THRU_v31( VRSAVE_COND ) 

RETURN 
FUNC EPILOG 
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#include "mudlib. h H 

7 * Return the offset in units of complex elements into the CorrO matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 

mUdlib u r s i 9 ned r cha f r fS *ptiv m ap, /* no -nore than 256 virts. per phys ♦/ 
int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 
*/ 

) 

^ int num_Corrs, num_virt_users; 

num virt users = mudlib_get_num_virt_users ( ptovjnap, 0, 0, 

start _phys_user, st art_virt_user ) - 1; 

num Corrs = (num virt users * tot virt users) - 

~ ( (num_virt_users * (num_virt_users + l) ) / ; 

return ( num_Corrs * (num_f ingers * num_fingers) ) ; 

} 

7 * Return the size (in bytes) of the portion of the CorrO matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending Physical 

* user, virtual user pair, inclusive. Elements of CorrO are assumed 

* to be of type COMPLEX_BF8 . 

ini - dli \S^ g ^ r ^i Ze 4 tov map , /* no more than 256 virts. per phys */ 

HI totjirf us4rs, £ s^offcv^ over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end virt_user /* must be < ptov_map [end_phys_user] */ 



} 



) 

int start_of fset , end_offset; 

start offset = mudlib get CorrO_offset ( ptov map, 

_ ~ num fingers, 

tot virt users, 
start phys user, 
start_virt_user ) ,- 

MUDLIB_INCR_VIRT_USER( ptovjnap, end_phys_user , end_virt_user ) 

end offset = mudlib get_CorrO_of f set ( ptov map, 
- ■ - - num fingers, 

tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof (C0MPLEX_BF8) ); 
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/* 

* Return the offset in units of complex elements into the Corrl matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 
*/ 

int mudlib qet Corrl offset ( . 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptov_map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptovjnap [start jphys_user] 

*/ 

) 

int num_Corrs, num_virt_users; 

num virt users = mudlib_get_num_virt_users ( ptovjnap, 0, 0, 

start_phys user, \ i . 

— ^ start_virt_user ) - 1; 

num^Corrs = (num_virt_users * tot_virt_users) ; 
return { num_Corrs * (rium_f ingers * num_f ingers) ) ; 

} 

/* 

* Return the size (in bytes) of the portion of the Corrl matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Corrl are assumed 

* to be of type COMPLEX_BF8 . 

*/ , 
int mudlib qet Corrl size ( _ 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 

int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptov_map over all phys users 
* / 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptovjnap [star t_phys_user] 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end__phys_user] */ 

) 

int start_of f set, end_offset; 

start offset = mudlib_get_Corrl_of f set ( ptov map, 

- ~ num fingers, 

tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER{ ptov_map, endj?hys_user , end_virt_user ) 

end offset = mudlib_get__Corrl_of f set ( ptov map, 
- num fingers, 

tot virt users, 
end phys user, 
end virt_user ) ; 



} 



return ( (end_offset - start_of f set ) * sizeof <C0MPLEX_BF8) )? 



* Return the offset into the R0 matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 
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int mudlib qet R0 offset ( . . , 

unsigned char *ptov_map, /* no more than 256 virts per phys */ 
int tot virt_users, /* sum of ptov_map over all phys users 



*/ 

int start phys user, 
int start_virt_user 



/* zero-based index into ptov map */ 
/* must be < ptov_map [start johys_user] 



{ 



int i, num_virt_users, offset, tools ; 

tCOls = (tot virt users + R MATRIX ALIGN MASK) & -R MATRIX ALIGN_MASK; 
num virt users = mudlib_get_num_virt_users ( ptov_map, 0, 0, 
start_phys_user, star t_virt_user ) - 1; 

offset =0; m 
for ( i = 0; i < num_virt_users; i++ ) 

Offset += (tools - (i & ~ R_MATR I X_AL I GN_MAS K ) ) ; 
return offset; 



*/ 
int 



Return the size (in bytes) of the portion of the R0 matrix 

corresponding to a specified starting physical user, virtual 

user (within the starting physical user) pair and an ending physical 

user, virtual user pair, inclusive. Elements of R0 are assumed 

to be of type BF8. 



mudlib get R0 size ( 

unsigned char *ptov_map, 
int tot virt users, 



/* 



*/ 

int 

int 

*/ 

int 

int 



start phys 
start virt 



user, 
user, 



end phys user, 
end virt user 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 

zero-based index into ptov map */ 
must be < ptovjnap [start_phys_user] 

zero-based index into ptov map */ 
must be < ptov_map [end_j>hys_user] */ 



) 



} 



int start_of f set , end_offset; 

start offset = mudlib get_R0_of f set ( ptov map, 

- ~ tot virt users, 

start phys user, 
start_virt_user ) ; 

MUDLIB_INCR__VIRT_USER( ptov_map, end_phys_user , end_virt_user ) 

end offset = mudlib_get_R0_of f set ( ptov map, 

- tot virt users, 

end phys user, 
end__virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof(BFS) ); 



int 



Return the offset into the Rl matrix corresponding to a specified 
starting physical user and starting virtual user (within the 
starting physical user) pair. 



mudlib get Rl offset ( 

unsigned char *ptov_map, 
int tot_virt_users, 
*/ 

int start_phys_user, 



/* no more than 2 56 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptovjnap */ 
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int start_virt_user /* must be < ptovjnap [start_phys_user] 

*/ 

) 

int num_virt_users, tcols; 

tcols - (tot virt users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN J4ASK; 
num virt users = mudlib_get_num_virt_users < ptovjnap, 0, 0, 
start jphys.user, s tart_virt_user ) - 1; 

return ( num_virt_users * tcols ) ; 

} 

/* 

* Return the size (in bytes) of the portion of the Rl matrix 

* corresponding to a specified starting physical user, virtual 

* user ^ithin 9 the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Rl are assumed 

* to be of type BF8. 
*/ 

int ^*JS^ B S t ^ ^ w _ BWf . n o more than 256 virts • Pe-phys */- 
int tot_virt_users, /* sum of ptov_map over all phys users 



*/ 
in 
in 
*/ 

) 



int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end virt_user /* must be < ptov_map [end_phys_user] */ 



^ int start_of f set, end__offset; 



start offset = mudlib get Rl_offset ( ptov map, 

~ tot virt users, 

start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER( ptov_map, end_phys_user , end_virt_user ) 

end offset = mudlib_get_Rl_of f set ( ptov map, 

- tot virt users, 

end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set ) * sizeof(BF8) ); 

} 

/* 

* Return the number of virtual users ( 

* corresponding to a specified starting physical user, V1 ^ ual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. 

*/ 

int mudlib qet num virt users ( , , 

unsigned-char' *ptov map, /* no more than 256 virts per phys*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_vi^t_user /* must be < ptov_map [end_phys_user] */ 

) 

int i, num_virt_users ; 

if ( start_phys user == end phys user ) 

return ( end_virt_user - start_virt_user + 1 ) ; 
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else { 

num virt users = ptov map [start phys_user] - start yirt_user; 
for~( i = (start_phys user +1); i < end_phys_user ; i++ ) 

num virt users += ptov map[i3; 
num virt_users += (end virt_user + 1) ; * ■ 

return ( num_virt_users ) ; * 

. } 



*/ 
void 



For a specified starting physical user, virtual user 
(within the starting physical user) pair and a specified 
number of virtual users inclusive of the starting pair, 
return (in separate arguments), the corresponding ending 
physical user, virtual user pair (inclusive) . 



mudlib get end user_pair ( 

unsigned char *ptov map, 
int start phys user, 
start virt user, 



int 

*/ 
int 
int 
int 



num virt users, 
*end phys user, 
*end virt user 



/* no more than 256 virts. per phys */ 
/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* number from start (must be > 0) */ 
/* zero-based index into ptov map */ 
/* will be < ptov_map[*end_phys_user] 1 



int i, j; 

for ( i = start phys user; ; i++ ) { 

for ( j = start virt user; j < ptov map[i]; j++ ) 

if ( --num virt users = = 0 ) break; 
if ( num virt users == 0 ) break; 
start_virt_user - 0; 

} 

*end phys user = i; 
*end virt_user = j ; 
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#include "mudlib. h" 

/****************************^ 

int mudlib get CorrO offset v { . . 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_user S/ /* sum of ptovjnap over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

' int i ( num_fingers_squared, remaining_size, skipped_virt_users, 
total_size; 

num fingers squared = num_fingers * num_fingers; 
skipped_virt_users = 0; 

for ( i = 0; i < start phys user; i++ ) 
skippedjvirt_users += (int) ptov_map [i] ; 

skipped_yirt_users += start_virt_user ; 

// Always even 

total size = tot_virt users * ( tot_virt users - 1 ) ; 
remaining size = ( tot virt users - skipped virt users ) 

* ( tot__virt_users - skipped_virt_users - 1 ) ; 

// zero based units of complex elements _ 

return ( num_f ingers_squared * ( ( total_size - remaining^ ize ) » 1 ) ) ; 

} 

int mudlib qet- Corrl offset v { . 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptovjnap [start_phys_user] 

*/ 

) 

'int i, num_fingers_squared, skipped_yirt_users; 

num fingers squared = num_fingers * num_fingers; 
skipped_virt_users = 0; 

for ( i = 0; i < start phys user; i+4 ) 
skipped_virt_users + = (int) ptov_map [i] ; 

skipped_virt_users += start_virt_user ; 

return ( num_f ingers_squared * ( skipped_virt_users * tot_virt_users ) ) ; 



int mudlib get R0 offset v ( , 

unsigned char "'ptovjnap, /* no more than 256 virts. per phys */ 

int tot_virt_users, /* sum of ptov_map over all phys users 
*/ 
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int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_j>hys_user] 

V 

) 

{ 

int i, iv; 

int RO_skipped_virt_users, R0_tcols, tools, size; 

tools = (tOt_virt_USers + R_MATR I X_AL I GN_MAS K ) & ~ R_MATR I X_AL I GN_MAS K ; 
RO_skipped_virt_users - 0; 
size = 0; 

for ( i = 0; i < start phys user; i++ ) { 

for ( iv = 0; iv < ( int) ptov_map [i] ; iv++ ) { 

R0_tcols = tcols - (R0_skippedjvirt_user's & ~ R_MATR I X_AL I G N_MAS K ) ; 

size += R0 tcols; 
++R0_skipped_virt_users; 

/* Handle last physical user, potentially split on virt users */ 

for ( iv = 0; iv < (int) start_virt_user ; iv++ ) { 

R0_tcols = tcols - (R0_skipped_virt_users & - R_MATRI X_AL I GN_MAS K ) ; 

size += R0 tcols; 
++R0_skipped_virt_users ; 

return size; 



int mudlib get R0 size v ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users, /* sum of ptov_map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_user] */ 

) 

{ 

int i, iv; 

int R0_skipped_virt_users, R0_tcols, tcols, size; 

tcols = (tot_virt_users + R_MATR I X_ALI GN__MAS K ) & ~R_MATRIX_ALIGN_MASK; 

R0 skipped virt users = 0; 

for ( i = 0; i < start phys user; i++ ) 

R0_skipped_virt_users += ( int ) ptov_map [i] ,- 

R0_skipped_virt_users += start_virt_user ; 

// print f ("skipped: %d\n", R0_skipped_virt_users) ; 

size = 0; 

if ( start_phys_user == end_phys_user ) 
// printf ( "start == end phys\n") ; 
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// <= for Inclusive . , 

for ( iv = start_virt_user; iv <= (int) end_virt_user; iv++ ) { 

R0_tcols = tcols - (R0_skipped_virt_users & - R_MATRI X_ALIGN_MASK ) ; 

size += R0 tcols; , 

II printf ("size: %d, ROtc: %d\n", size, R0_tcols) ; 

++R0 skipped_virt_users; 

- - - 

else 

' for { i = start_phys_user; i < end phys user; i++ ) { 
for ( iv = 0; iv < (int) ptov_map [i] ; iv++ ) { 

R0_tcols = tcols - (RO_skipped_virt_users & ~ R_MATR I X_AL I GN_MA S K ) ; 
size += R0 tcols; 

II printf ("size: %d, ROtc : %d\n", size, R0_tcols) ; 
++R0_skipped_virt_users; 

/* Handle last physical user, potentially split on virt users */ 
// printf ("last phys user \n") ; 

// <= for Inclusive . . . 

for ( iv = start_virt_user; iv <= (int) end_virt_user ; iv++ ) { 

R0_tcols = tcols - (RO_skipped_virt_users & ~RJ4ATRIX_ALIGN_MASK) ; 

size += R0 tcols; , 

// printf ("size: %d, ROtc: %d\n" , size, R0_tcols) ; 

++R0 skipped_virt_users; 

return size; 

} 



int mudlib get Rl offset_v ( 

unsigned char *ptov_map, 
int tot_virt_users , 
*/ 

int start phys user, 
int start_virt_user 
*/ 

) 

int i, tcols, virt_users; 
tcols = (tOt_virt_users + R_MATR I X_AL I GN_MAS K ) & ~R_MATRIX_ALIGN_MASK ; 

virt_users = 0; 
// Main loop 

for ( i = 0; i < start phys user; i++ ) ( 
virt_users += (int) ptov_jt\ap [i] ; 

} 

// Trailing virtual users 
virt_users += start_virt_user ; 

return ( virt_users * tcols ) ; 

} 



/* no more than 256 virts. per phys */ 
/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start jphys_user] 
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int mudlib get Rl size y ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users, /* sum of ptov_map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptovjnap fstart_phys_user) 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_j>hys_user] */ 

) 

{ 

int i, tcols, virt_users; 

tcols = (tOt_virt_USers + R^MATR I X_AL I GN_MA S K ) & - R_MATR I X_AL I GN_MAS K ; 
virt_users = 0; 

if ( startjphys_user == end_phys_user ) 

virt users = end virt user - start virt user + 1; 

} " " 

else if (start_phys_user < end_phys_user) 
// Leading virtual users 

virt_users = (int) ptov_map [start_phys_user] - start_virt_user ; 
// Main loop 

for ( i = (start phys user + 1) ; i < end_phys_user ; i++ ) 
virt_users += (int) ptov_map [i] ; 

// Trailing virtual users 

virt users += (end virt_user + 1) ; 

} 

return ( virt users * tcols ) ; 

} 
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tfdefine 10 1 
tfdefine TIME 0 

// 

// Asynchronous MPIC 
// 

#if TIME 

#include <tmr.h> 
#endif 

#include "mudlib.h" 

void sve3_8bit( BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ); 

void dotpr3_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) ; 

void dotpr6J3bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 

BF32 *sums, int N, int tcols ); / 

void dotpr9_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums # int N, int tcols ) ; 

#if TIME 

static int time_count = 0; 

static int z; 

static float time; 

static TMR ts timet), timel; 

static TMR__timespec elapsed; 

#endif 

/* 

* void async multirate mpic ( BF8 *Bt hat, BF8 *R0 hat, 
+ BF8 *R1 hat, BF8 *Rlm hat, 

* BF32 *Y, BF32 Ythresh, 

* int N_users, int N_bits, int N_stages ) 
* 

* N users must be > 0 and divisible by 4 

* N_bits must be >= 5 
*/ 

void mudlib_mpic ( BF8 *Bt hat, 
BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat < 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) 

^ BF8 *Bt hatp; 

BF8 *R0 hatp, *Rl_hatp, *Rlm_hatp; 
BF32 *Yp; 

BF32 R bias, sums [3] ; 

int hat_tc, i, m, N_users_pad, stage; 

hat tc = (N users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN MASK; 
N_users_pad~= (N_users + ALT I VEC_ALI GN_MAS K ) & ~ALTIVEC_ALIGN_MASK; 

#1 if°( ( (long)Bt hat | (long)RO upper bf | (long)RO lower bf | 

(long)Rl trans bf | (long) Rim bf) & ALTIVEC ALIGN_MASK ) { 
printf ( ****** inputs are NON-ALIGNED *****\n" ); 
exit { -1 ) ; 
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} 

#endif 
// 

// Subtract interference in N_stages 

II ^ I 

for ( stage = 0; stage < N_stages; stage++ ) [ 

RO hatp = RO hat; 
Rl hatp = Rl hat; 
Rim hatp = Rlm_hat; 
Yp = Y; 

for ( i = 0; i < N_users; i++ ) { 

sve3_8bit( R0_hatp, Rljiatp, Rlmjiatp, &R_bias, N_usersj?ad ); 

#lf ° R0_hatp[i] = BF8_ZERO; /* zero diagonal element */ 

#endif 

Bt hatp = Bt_hat + hat_tc ; /* points to leading row */ 

m = 2; 

while ( m < (N bits-4) ) '{ 

if ( BFABS { Yp[m} ) < Ythresh ) { 

if ( BFABS ( Yp[m+1] ) < Ythresh ) { 
if ( BFABS ( Yp[m+2] ) < Ythresh ) { 

dotpr9_8bit( Bt hatp, Rl hatp, RO hatp, Rlmjiatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; 

sums[l] -= ( (BF32) Bt hatp [hat tc + i] * (BF32) Rljiatp [l] ) ; 
if ( <Yp[m] - sums[0]) > BF32 ZERO ) 
Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 

S Bt hatp [hat tc + i] = -1 + BIAS 8BIT; 

sumsll] += ( (BF32)Bt_hatp[hat_tc + i] * (BF32) Rl_hatp [l] ) ; 

sumsfl] -= R bias; r ._. 
sums[2] -= ( (BF32) Bt hatp[2*hat tc + i] * (BF32) Rl_hatp [i] ) ; 
if ( (Yp[m+1] - sums [1]) > BF32 ZERO ) 

Bt_hatp[2*hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt hatp[2*hat tC + i] = -1 + BIAS 8BIT; 
sums [2] += { (BF32)Bt_hatp[2*hat_tc + i] * (BF32) Rl_hatp [l] ) ; 

sums [2] -= R bias; 

if ( (Yp[m+2] - sums [2]) > BF32 ZERO ) 

Bt_hatp[3*hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp[3*hat_tc + i] = -1 + BIAS_8BIT; 

\l se { /* skip third sum */ 

dotpr6_8bit { Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; , r ._. 

sums EU -= ( (BF32) Bt hatp [hat tc + i] * (BF32) Rljiatp [l] ) ; 
if ( (Yp[m] - sumsEO]) > BF32 ZERO ) 

Bt_hatp[hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt hatpthat tc + i] - -1 + BIAS 8BIT; 
sums[l] += ( (BF32)Bt_hatp[hat_tc + i] * (BF32) Rljiatp [l ] ) ; 

sumstl] -= R bias; 

if ( {Yp[m+1] - sumstl]) > BF32 ZERO ) 

Bt_hatp[2*hat_tc + i] = 1 + BIAS_8BIT; 
else 
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ttif 10 
tfendif 



} 



Bt_hatp[2*hat_tc + i] = -1 + BIAS_8BIT; 



Bt_hatp += hat_tc; 
++m; 



/* bump leading row pointer */ 
/* bump row */ 



#if 10 
#endif 

#if 10 
#endif 



else { /* skip second sum */ 

dotpr3_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] - = R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp [hat_tC + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp[hat_tc + i] = -1 + BIAS_8BIT; 

} 



} 



Bt_hatp += hat_tc; 
+ +m; 



} 



Bt_hatp += hat_tc; 
++m; 



/+ bump leading row pointer */ 
/* bump row */ 

/* bump leading row pointer */ 
/* bump row */ 



f * 

* do last 0, 1 or 2 dot product calculations 



*/ 



\ 



while ( m < (N bits-2) 

if ( BFABS ( Yp[m] ) < Ythresh ) { 

dotpr3_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat__tc ) ; 
sums[0] -= R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp[hat__tc + i] = 1 + B1AS_8BIT; 
else 

Bt hatp[hat_tc + i] « -1 + BIAS_8BIT; 

} 



#if io 
#endif 

} 

#if 10 



Bt_hatp += hat_tc; 
++m; 



RO hatp + = hat tc; 
Rl hatp += hat tc; 
Rim hatp + = hat_tc; 
Yp += Njbits; 



ttendif 

/ 

#if defined ( C0MPILE_C ) 

void dotpr3_8bit( BF8 *A, BF8 *B0 , BF8 *B1, BF8 *B2 , 
BF32 *sums, int N, int tcols ) 

{ 



/* bump leading row pointer */ 



/* bump pointer */ 
/* bump pointer */ 
/* bump pointer */ 
/* bump pointer */ 

/* end of loop over N users */ 
/* end of loop over N_stages */ 



int j ; 

sums [0] 
for ( j 



BF32_ZERO; 

0; j < N; j++ ) { 



Page No. 249 



EV 093 931 868 US 
Page No. 276 

mpic.c 2/23/2001 

sums[0] += (BF32)A[j] * (BF32)B0[j); 
sunislO] += (BF32)A[tcols+j3 * (BF32)Bl[jJ; 
sums[0] += (BF32)A[(tCOls<<l)+j] * (BF32) B2 [j ] ; 

void dotpr6_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) 



{ 



int i, j; 

for ( i = 0; i < 2; i++ ) { 
sums[i] = BF32_ZERO; 

for ( j = 0; j < N; j++ ) { 

sums[i] + = (BF32)A[i*tcols + j] * (BF32) B0 [j ] ; 

sumsfi] += (BF32)A[ (i+1) *tcols + j] * <BF32) Bl [j ] ; 

sums[i] += (BF32)A[(i+2)*tcols + j] * (BF32) B2 [j ] ; 



void dotpr9_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) 

{ 

int i, j; 

for ( i = 0; i < 3; i++ ) { 
sumsfi] = BF32_ZERO; 
for ( j = 0; j < N; j++ ) { 

sumsti] += (BF32)A[i*tcols + j] * (BF32) B0 [j] ; 
sums[i] += (BF32)A[ (i+1) *tcols + j] * (BF32) Bl [j ] ; 
sumsli] += (BF32)A[ (i+2) *tcols + j] * (BF32) B2 [j } ; 



} 



} 



void sve3 8bit ( BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ) 
{ 

int i ; 
BF32 wsum; 

wsum = 0; 

for ( i = 0; i < n; i++ ) { 

wsum += (BF32) A[i] ; 

wsum += (BF32) B [i] ; 

wsum += (BF32)C[i] ; 

} 

*sum = wsum; 

} 

tfendif 
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--- MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN R_MATRICES . MAC 

Description: Float and scale R matrix values, convert to byte. 

Entry/params: GEN R MATRICES ( Rsump, Bf scalep, Inv scalep, 

1 * Scalep, No scale row bfp, 

Scale_row_bf p, Num_virt_users ) 

Formula : 

bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_virt_users; i++ ) { 
scale = scalep [i] ; 
fsum = (float) (R sums[i] ) ; 
fsum *= bf_scale; 

fsum scale = fsum * inv_scale; 
fsum_scale *= scale; 

SATURATE ( fsum_scale ) 
SATURATE ( fsum ) 



no scale row bfp[i] = BF8 FIX( fsum ); 
scale_row_bfp[i] = BF8_FIX ( fsum_scale ); 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 

0.0 
0.1 

0.3 



Date 

000910 
000914 

000920 



Engineer Reason 
fpl Created 

fpl Removed VMAXFP and added 

windin code 
fpl Removed all windin and windout 



/ 



#include "salppc.inc" 
tfdefine DOJEO 1 
#if DO IO 

#define SCALE_BUMP_16 16 
#else 

#define SCALE_BUMP_1 6 0 
#endif 

ttdefine STORE_SCALE{ vS, rA, rB ) STVX{ vS, rA, rB ) 

#define ZERO^COND 6 

RODATA_SECTI0N ( 6 ) 

START_L_ ARRAY ( local_table ) 

/** 

First stage for byte pack 
L*PERMUTE_MASK( 0x0004080c, 0x1014181c, 0x0004080c, 0x1014181c ) 
/ * * 
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Second stage for byte pack 

**/ X 

L_PERMUTE_MASK ( 0x00010203, 0x04050607, 0x10111213, 0x14151617 ) 

END_ARRAY 

/** 

Input parameters 
**/ 

# define Rsump r3 

tfdefine Bf scalep r4 

#define Inv scalep r5 

#define Scalep r6 

#def ine No scale row bfp r7 

#def ine Scale row bfp r8 

tfdefine Num_virt_users r9 

/** 
Local GPRs 

W 

#define indxl rlO 

#define indx2 rll 

#define indx3 rl2 

#define low4 r0 

tfdefine tptr indx2 
tfdefine 1ow4x4 low4 



G4 registers 




**/ 






#def ine 


zero 


vO 


#def ine 


inv scale 


vl 


#def ine 


bf scale 


v2 


^define byte pack 


v3 


#define byte_merge 


v4 


#def ine 


scaleO 


v5. 


#def ine 


scalel 


v6 


#def ine 


vtmp 


scalel 


#def ine 


scale2 


v7 


#def ine 


vtmp2 


scale2 


#def ine 


scale3 


v8 


#def ine 


f sumO 


v9 


#def ine 


f suml 


vlO 


#def ine 


f sum2 


vll 


#def ine 


f sum3 


v!2 


#define 


fsum scaleO 


vl3 


#def ine 


fsum scalel 


vl4 


#def ine 


fsum scale2 


vl5 


#define 


f sum_scale3 


vl6 


#def ine 


bsumO 


vl7 


#define bsuml 


vl8 


#def ine 


bsum2 


vl9 


#def ine 


bsum3 


v20 


#def ine 


bsum scaleO 


v21 


#def ine 


bsum scalel 


v22 


#def ine 


bsum scale2 


v23 


Jfdef ine 


bsum_scale3 


v24 


#def ine 


bvector 


v25 
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#define bscale_vector v26 

#define rsumO v27 

fldefine rsuml v28 

#define rsum2 v29 

tf define rsum3 v30 

#define seven v31 



/** 

Begin code text 
**/ 

FUNC_PROLOG 

ENTRY 7( gen R matrices, Rsump, Bf scalep, Inv scalep, Scalep, \ 
" No_scale_row_bfp, Scale_row_bf p, Num_virt_users ) 

CMPWK Num_virt_users, 0 ) 

BGT ( start ) 

RETURN 

LABEL ( start ) 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

Load up permute vectors and loop scalers 
**/ 

LA( tptr, local_table, 0 ) 

LI( indxl, 16 ) 

LVX( byte_pack, 0, tptr ) 

VSPLTISBf seven, 7 ) 

LVX( byte merge, tptr, indxl ) 

SCALAR SPLAT ( bf scale, vtmp, Bf scalep ) 

SCALAR SPLAT ( inv_scale, vtmp, Inv_scalep ) 



* * 



7 Back up to nearest 16-byte boundary. It's okay to write before and after to 
nearest 16-byte boundary in both directions. 

7 RLWINM( low4, No scale_row_bf p, 0, 28, 31 ) /* lower 4 bits */ 

VXOR( zero, zero, zero ) 

ADD ( Num virt users, Num virt users, low4 ) 
SUB{ No scale row bfp, No scale row bfp, low4 ) 
SUB( Scale row bfp, Scale_row_bfp, low4 ) 
SLWI ( 1ow4x4 , low4 , 2 ) 
LI< indx2, 32 ) 
SUB( Rsump, Rsump, 1ow4x4 ) 



j ★ * 

Start up loop 
★ + j 

LVX( rsumO, 0, Rsump ) 

LI ( indx3, 48 ) 

LVX( rsuml, Rsump, indxl ) 

SUB( Scalep, Scalep, low4x4 ) 

LVX( rsum2, Rsump, indx2 ) 

VCFSX( fsumO, rsumO, 0 ) 

LVX( rsum3, Rsump, indx3 ) 

VCFSX( fsuml, rsuml, 0 ) 

LVX( scaleO, 0, Scalep ) 

VCFSX( fsum2, rsum2 , 0 ) 

LVX( scalel, Scalep, indxl ) 

VCFSX( fsum3, rsum3, 0 ) 

LVX( scale2, Scalep, indx2 ) 

VMADDFP ( fsumO, fsumO, bf scale, zero ) 

LVX( scale3, Scalep, indx3 ) 

VMADDFP ( fsuml, fsuml, bf scale, zero ) 

ADDIC C( Num virt users, Num virt users, -16 ) 

VMADDFP ( fsum2, fsum2, bf_scale, zero ) 
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VMADDFP ( fsum3, fsum3, 
VMADDFP ( fsum scaleO, 
VMADDFP ( fsum scale! , 
VMADDFP ( fsum scale2, 
ADDI ( Rsump, Rsump, 64 
VMADDFP ( fsum_scale3, 
ADDK Scalep, Scalep, 
VMADDFP ( fsum scaleO, 
VMADDFP ( fsum scalel, 
VMADDFP ( fsum scale2, 
VMADDFP ( fsum scale3, 
BLE { sixteen_sums ) 



2/23/2001 



mv 
inv 



bf scale, 
fsumO, inv 
f suml , 
f sum2 , 

) 

fsum3, 
64 ) 

fsum scaleO 
fsum scalel 
fsum scale2 
fsum scale3 



zero ) 

scale, zero ) 

scale, zero ) 

scale, zero ) 



inv scale, zero ) 



scaleO , 
scalel , 
scale2, 
scale3 , 



zero ) 
zero ) 
zero ) 
zero ) 



LVX( rsumO, 0, Rsump ) 
LVX( rsuml, Rsump, indxl ) 
VCTSXS( bsumO, fsumO, 24 ) 
LVX( rsum2, Rsump, indx2 ) 
VCTSXS( bsuml, fsuml, 24 ) 
VCTSXS( bsum2, fsum2, 24 ) 
LVX( rsum3, Rsump, indx3 ) 
ADDI { Rsump, Rsump, 64 ) 
VCTSXS( bsum3, fsum3, 24 ) 
LVX( scaleO, 0, Scalep ) 
VCTSXS( bsum scaleO, fsum scaleO, 
VCTSXS( bsum_scalel, 
LVX( scalel, Scalep, 
VCTSXS( bsum_scale2, 
LVX( scale2, Scalep, 



24 
24 



fsum scalel, 
indxl ) 

fsum scale2, 24 ) 
indx2 ) 

ADDK No scale row' bfp, No scale row_bfp, - SCALE_BUMP_1 6 ) 
VCTSXSt bsum scale3, fsum scale3, 24 ) 
ADDK Scale_row_bfp, Scale_row_bfp, -SCALE_BUMP_16 ) 

BR ( mloop ) 

/** 

Top of loop outputs 32 bytes per trip 
**/ 

LABEL { loop ) 
/* { */ 

STORE SCALE ( bvector, 0, No scale__row bfp ) 

VCTSXS( bsum_scale3, fsum scale3, 24 ) 

STORE SCALE ( bscale_vector , 0, Scale__row_bf p ) 



LABEL ( mloop ) 

LVX( scale3, Scalep, 
VCFSX( fsumO, rsumO, - , 
VPERM( bsumO, bsumO, bsuml, byte_pack ) 
VCFSX( fsuml, rsuml, 0 ) 
VCFSX( fsum2, rsum2 



indx3 ) 
0 ) 



0 ) 



ADDK No scale row bfp, No_scale_row_bf p, SCALE_BUMP_16 ) 
VCFSX( fsum3, rsum3, 0 ) 

ADDK Scale row_bfp, Scale row bfp, SCALE_BUMP_16 ) 
VMADDFP ( fsumO, fsumO, bf scale, zero ) 
VPERM( bsum2, bsum2 , bsum3, byte_pack ) 
VMADDFP ( fsuml, fsuml, bf scale, zero ) 
VMADDFP { fsum2, fsum2, bf_scale, zero ) 

VMADDFP ( fsum3, f sum3 , bf scale, zero ) 
VMADDFP ( fsum scaleO, fsumO, inv scale, zero ) 
VPERM ( bvector, bsumO, bsum2, byte merge ) 
VMADDFP ( fsum scalel, fsuml, inv scale, zero ) 
ADDIC C{ Num virt users, Num_virt users, -16 ) 
VMADDFP ( fsum scale2, fsum2, inv scale, zero ) 
VMADDFP ( fsum_scale3, fsum3, inv_scale, zero ) 
ADDK Scalep, Scalep, 64 ) 

VMADDFP ( fsum scaleO, fsum scaleO, scaleO, zero ) 

VPERM ( bsum scaleO, bsum scaleO, bsum scalel, byte_pack ) 

VMADDFP ( fsum_scalel, fsum_scalel, scalel, zero ) 
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VMADDFP ( fsum scale2, fsum scale2, scale2, zero ) 
VMADDFP ( fsum scale3, fsum scale3, scale3, zero ) 
VPERM( bsum scale2, bsum scale2, bsum_scale3, byte_pack ) 

VSRB( vtmp, bvector, seven ) 
VPERM( bscale vector, bsum scaleO, bsum_scale2, byte_merge ) 

VSRB{ vtmp2, bscale_vector, seven ) 
BLE ( loop_flush ) 

LVX( rsumO, 0, Rsump ) 

VADDSBS( bvector, bvector, vtmp ) 
LVX( rsuml, Rsump, indxl ) 

VADDSBS( bscale vector, bscale_vector , vtmp2 ) 
LVX( rsum2, Rsump, indx2 ) 
VCTSXS{ bsumO, fsumO, 24 } 
LVX( rsum3, Rsump, indx3 ) 
VCTSXS( bsuml, fsuml, 24 ) 
ADDI ( Rsump, Rsump, 64 ) 
VCTSXS( bsum2, fsum2, 24 ) 
LVX( scaleO, 0, Scalep ) 
VCTSXS( bsum3, fsum3, 24 ) 
LVX( scalel, Scalep, indxl ) 
VCTSXS( bsum scaleO, fsum scaleO, 24 ) 
VCTSXS( bsum_scalel, fsum scalel, 24 ) 
LVX( scale2, Scalep, indx2 ) 
VCTSXS( bsum_scale2, fsum_scale2, 24 ) 

* } */ 
BR ( loop ) 



/** 

Flush loop 
**/ 

LABEL ( loop flush ) 

VADDSBS( bvector, bvector, vtmp ) 
STORE SCALE ( bvector, 0, No scale row bfp ) 

VADDSBS( bscale vector, bscale vector, vtmp2 ) 
STORE SCALE ( bscale vector, 0, Scale row bfp ) 
ADDI (""No scale row bfp, No scale row bfp, SCALE BUMP_16 ) 
ADDI { Scale_row_bfp, Scale__row_bf p, SCALE_BUMP_1 6 ) 

LABEL { sixteen_sums ) 

VCTSXS( bsurnO, ' fsumO, 24 ) 

VCTSXS( bsuml, fsuml, 24 ) 

VCTSXS( bsum2, f sum2 , 24 ) 

VCTSXS( bsum3, f sum3 , 24 ) 

VCTSXS( bsum scaleO, fsum scaleO, 24 ) 

VPERM{ bsumO, bsumO, bsuml, byte pack ) 

VCTSXS{ bsum scalel, fsum scalel, 24 ) 

VPERM{ bsum2, bsum2, bsum3, byte pack ) 

VCTSXS( bsum scale2, fsum scale2, 24 ) 

VPERM( bvector, bsumO, bsum2 , byte merge ) 

VCTSXS{ bsum_scale3, fsum_scale3, 24 ) 

VPERM( bsum scaleO, bsum scaleO, bsum scalel, byte pack ) 
VPERM( bsum scale2, bsum scale2, bsum_scale3, byte_pack ) 

VSRB{ vtmp, bvector, seven ) 
VPERM( bscale vector, bsum scaleO, bsum_scale2, byte_merge ) 

VADDSBS( bvector, bvector, vtmp ) 

VSRB( vtmp, bscale vector, seven ) 
STORE SCALE ( bvector, 0, No scale row bfp ) 

VADDSBS( bscale vector, bscale vector, vtmp ) 
STORE_SCALE{ bscale_vector , 0, Scale_row_bf p ) 

I * * 

Return 
**/ 

LABEL { ret ) 

FREE THRU_v31 ( VRSAVE_COND ) 
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RETURN 
FUNC EPILOG 
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******************************************* 
I.************************************************************* 
. _** 

--** Majority Voter Control Logic 

1-** Description: This Module serves as a generic majority voter 
_ _ ** 
- - * * 

-** Author : Steven Imperiali/Mirza Cifric 



-** Date 
. * * 



5-15-2000 



__************************************** 
LIBRARY IEEE; 

USE IEEE.STD LOGIC 1164 .ALL; 
use ieee.std logic arith.all; 
use ieee.std logic unsigned. all ; 
USE STD . TEXTIO . ALL ; 



*********************** 



ENTITY mjvoter IS 
P0RT( 

elk 66 pal6 
reset 0 
request 0 0 
request 1 0 
request2 0 
request3 0 
request 4 0 
healthyO 1 
healthyl 1 
healthy2 1 
healthy3 1 
healthy4 1 
voteout 0 



IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic ; 


IN 


std logic; 


IN 


std logic; 


IN 


std logic; 


OUT 


std_logic) ; 



END m__voter; 

ARCHITECTURE voter OF m voter IS 

signal pro: STD__LOGIC VECTOR (3 downto 0); 

signal against: STD_LOGIC_VECTOR (3 downto 0) ; 

signal result: STD_L0GIC; 

BEGIN 



check result -.process (request0_0 , requestl_0 ( request2_0 , request3^0 , request4_0 , h 

ealthyO 1, , _ A . . 

healthyl l # healthy2 l,healthy3 l,healthy4 1) 
variable"pro: STD__LOGIC VECTOR (3 downto 0); 
variable against: STD LOGIC VECTOR{3 downto 0) ; 
variable solution: STD_LOGIC; 

be9i pro:= "0000"; " set nun^er of pro voters 

aqainst :="0000" ; 
set number of against voters-- Get the number of pros 
if (healthy0_l = »1' and request0_0= ■ 1 1 ) then 
pro := pro + "0001"; 
end if; 

if (healthyl l-'l' and requestl_0= ' 1 ' ) then 
pro := pro + "0001"; 
end if; 

if (healthy2_l= , l' and request2_0= ' 1 ) then 
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pro := pro + "0001" ; 
end i f ; 

if (healthy3 l-'l' and request3_0= 1 1 ' ) then 
pro := pro + "0001"; 
end if; 

if (healthy4 1=*1' and request4_0= 1 1 ' ) then 
pro := pro + "0001"; 
end if; 
-- Get the number of cons 

if (healthyO 1 = '1' and request0_0= 0 ) then 
against : = against + "0001"; 
end if; 

if (healthyl 1 = '1' and request 1 J) = '0') then 
against : = against + "0001"; 

end if; . _ 

if (healthy2 1 - '1' and request2_0 ='0') then 
against := against + "0001"; 

end if; . 
if (healthy 3 1 -'1' and request3_0 ='0') then 
against := against + "0001"; 
end if; 

if (healthy4 1 ='1' and request4_0 = ■ 0 ' ) then 
against := against + "0001"; 
end if; 

-- final score . m -i 

if (pro = "0001" and against < "0001") then 

S ° 1Ut elsifipro - "0010" and against < "0010") then 

solution := '1'; 
elsif (pro - "0011" and against < "0011") then 

solution := '1'; MX _ 

elsif (pro = "0100" and against < "0011") then 

solution := 1 1'; 
elsif (pro = "0101" and against < "0011") then 

solution : = ' 1 1 ; 

/ else solution := '0'; 

" ' result <= solution; ~ put variable val into 

vo?eout"o 1 <= solution; -- put variable val into 

signal val 

end process check_result ; 



result_latch:process (resetj), clk_66_pal6) 
begin 

IF (reset 0 = '0') THEN 

voteout 0 <= ' 1 ' ; 

ELSIF rising edge (elk 66 pal6) THEN 
IF result = '0' THEN 

voteout_0 <= ' 0 ■ ; 

END IF; 

END IF; 
END PROCESS ; 

END voter; 



2 



Page No. 258 



EV 093 931 868 US 

Page No. 285 3/9/2001 

m voter. vhd 



3 



Page No. 259 



EV 093 931 868 US 

Page No. 286 2/23/2001 

mudlib.h 

* 

* FILENAME: mudlib.h 

* CC NUMBER: 

* ABSTRACT: 

* USAGE: 
* 

* COMMENTS: 

* AUTHOR: M. Vinskus 
* 

* DATE: 18 -JUL- 2 000 
**/ 

/* ©MERCURY . COPYRIGHT . H@ */ 
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#ifndef MUDLIB H 
fldefine MUDLI B_H 



/*******************★****************************+*************★************* 
* * * 

* INCLUDE FILES 

★♦A************************************************************************* 



#include <sal.h> 



/**★***********★*★******+*********************************************★****** 
+ ** 

* DEFINED CONSTANTS 

**************************************************************************** 



#define NUM FINGERS LOG 2 

#define NUM FINGERS_SQUARED LOG (2 * NUM FINGERS JjOG) 
fldefine NUM FINGERS (1 << NUM FINGERS LOG) 

tfdefine NUM_FINGERS_SQUARED (1 « NUM_FINGERS_SQUARED_LOG) 

^define LI CACHE SIZE 32768 
fldefine L1_CACHE_LINE_SIZE 32 

#define LI CACHE ALIGN_LOG 5 

#define LI CACHE ALIGN (1 « LI CACHE ALIGN_LOG) 
tfdefine L1_CACHE_ALIGNJ4ASK ( L1_CACHE_ALIGN - 1) 

#define R MATRIX ALIGN_LOG 5 

^define R MATRIX ALIGN (1 << R MATRIX ALIGN_LOG) 
^define R_MATR I X_AL I GN_MAS K (R_MATRIX_ALIGN - 1) 

#define ALTIVEC ALIGN_LOG 4 

#define ALTIVEC ALIGN (1 << ALTIVEC ALIGN_LOG) 
#define ALTIVEC_ALIGN_MASK ( ALT I VEC_AL I GN - 1) 

#define BF CORR FRAC BITS 8 

#define B F_CORR__FACTOR ( (float) (1 << BF_CORR_FRAC_BITS) ) 

#define BF MPATH FRAC BITS 15 /* this should be dynamic */ 

#define BF_MPATH_FACTOR ( (float) (1 << BF_MPATH_FRAC_BITS) ) 

#define BF RSUMS FRAC_BITS ((2 * B F_M PATH_FRAC_B I TS ) - 16 + 
BF CORR FRAC BITS) 

#define~BF RSUMS FACTOR ((float) (1 << BF RSUMS FRAC_BITS) ) 
tfdefine BF_RSUMS_RFACTOR (1.0 / BF_RSUMS_FACTOR) 

tfdefine BF RY FRAC BITS 9 /* 0 <= BF RY_FRAC_BITS <= 14 */ 

tfdefine BF RY FACTOR ((float) (1 << BF RY FRAC_BITS) ) 
tfdefine BF_RY_R FACTOR (1.0 / BF_RY__ FACTOR) 

#define BF COMBINED FACTOR ((float) ( 1 « 
(BF RSUMS FRAC BITS-BF RY FRAC BITS) ) ) 

#define BF_COMBINED_RFACTOR (1.0 / B F_COMB I NED_F ACTOR ) 

fldefine BF8 ZERO 0 
tfdefine BF8 MAX 0x7f 

#define BF8 RY ONE ( (BF8) (1 « BF RY FRAC BITS)) 
tfdefine BF16 RY ONE ( (BF16) (1 « BF_RY_FRAC_BITS) ) 
^define BF16 RY MONE ( -BF16_RYJDNE) 
^define BF16 ZERO 0 
#define BF16 MAX 0x7fff 
#define BF32 ZERO 0 

tfdefine BF32 RY ONE ( (BF32) (1 « BF_RY_FRAC_BITS) ) 
^define BF32_MAX 0x7fffffff 
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^define BIAS_8BIT 1 

ttdefine BFABS ( x ) (((x) >« 0) ? (x) : (-(x))) 
#define FABS ( f ) (((f) >= 0.0) ? (f) : (-(f))) 

/************************** + ********************************************* **** 
*** 

* TYPE DEFINITIONS 

**************************************************************************** 
**/ 

typedef long BF32; 
typedef short BF16; 
typedef char BF8; 

typedef struct { 

BF8 real ; 

BF8 imag; 
} COMPLEX_BF8; 

typedef struct { 

BF16 real; 

BF16 imag; 
} COMPLEX_BF16; 

typedef struct { 

BF32 real; 

BF32 imag; 
} COMPLEX_BF32; 

/***★******++***********★+★**★*+********************************************* 
* * * 

* MACRO DEFINITIONS 

************************************** *************^ 

* * i 

/* assumes (-(2.0 * 7) - 0.5) < (bf_factor * s) < ((2.0 A 7) - 0.5) */ 

tfdefine SFtoBF8 ( bf factor, s ) \ 

( (BF8) ( (bf_factor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF8 ( bf factor, v, bfv, n ) \ 

{ \ 

int i; \ 

float factor = bf factor; \ 

vsmulx ( v, 1, &factor, v, 1, n, 0 ); \ 

for ( i = 0; i < n; i++ ) \ 

bfv[i] = (v[i) > 0.0) ? <BF8)(v[i] + 0.5) : (BF8)(v[i] - 0.5); \ 

} 

#define SBF8toF( bf rf actor, bfs ) \ 
( (bf_rfactor) + ( float ) (bfs) ) 

tfdefine VBF8toF( bf_rf actor, bfv, v, n ) \ 

{ \ 

int i; \ 

float rfactor = bf rfactor; \ 
for ( i = 0; i < n; i++ ) \ 

v[i] = (float)bfvtil ; \ 
vsmulx ( v, 1, &rfactor, v, 1, n, 0 ); \ 

} 

/* assumes (-(2.0 A 15) - 0.5) < (bf_factor * s) < ((2.0 A 15) - 0.5) */ 

#define SFtoBFIG ( bf factor, s ) \ 

( (BF16) ( (bf_factor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF16 ( bf_factor, v, bfv, n ) \ 
( \ 
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float factor = bf factor; \ 

vsmulx ( (float *)v, 1, &factor, (float *)v 1, n, 0 ) , \ 
vfixrx ( (float *)v, 1, (BF16 *)bfv, 1, n, 0 ); \ 

} 

#define SBF16toF( bf rf actor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

tfdefine VBF16toF( bf_rfactor, bfv, v ( n ) \ 

^ float rf actor = bf rf actor; \ 

vfltx ( (short *)bfv, 1, v, 1, n, 0 ); \ 
vsmulx ( v # 1, fcrfactor, v, 1, n, 0 ); \ 

} 

/* assumes (-(2.0 * 31) - 0.5) < (bf .factor * x) < ((2.0 A 31) - 0.5) */ 

#define SFtoBF32 ( bf factor, s ) \ 

((BF32) ((bf_factor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

tfdefine VFtoBF32 ( bf_factor, v, bfv, n ) \ 

' float factor = bf factor; \ 

vsmulx ( v, 1, fcfactor, (float *)bfv, 1, n, 0 ); \ 
vfixr32x ( (float *)bfv, 1, (int *)bfv, 1, n, 0 ) ; \ 

} 

tfdefine SBF32toF( bf rfactor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

tfdefine VBF32toF( bf_rfactor, bfv, v, n ) \ 

' float rfactor = bf rfactor; \ 

vflt32x ( (int *)bfv, 1, v, 1, n, 0 ); \ 
vsmulx ( v, 1, &rfactor, v, 1, n, 0 ); \ 

} 

Mdefine CORR SFtoBF ( s ) SFtoBFS ( BF CORR FACTOR, s ) 

Jdefint MPATH_VFtoBF ( v, bfv, n ) VFtoBFIG ( BF_MPATH_F ACTOR, v, bfv, ((n)«l) 
) 

#define BHAT SFtoBF ( s ) ( (BF8) ( (e) + (f S'SlT) 

tfdefine BHAT SBFtoF ( bfs ) (( float) (bfs) - ( float) BIAS_8BIT) 

#define BHAT_VFtoBF ( v, bfv, n ) \ 

' float bias = (float) BIAS 8BIT; \ 
vsaddx( v, 1, &bias, v, 1, n, 0 ); \ 
fixpixax( v, 1, bfv, n, 0 ) ; \ 

} 

#define BHAT_VBFtoF ( bfv, v, n ) \ 

' float bias = (float) (-BIAS 8BIT) ; \ 
fltpixax( bfv, v, 1, n, 0 ); \ 
vsaddx( v, 1, &bias, v, 1, n, 0 ) ; \ 

} 

MHpfinP RHAT SFtoBF ( s ) SFtoBF8 ( BF RY FACTOR, s ) 

{define 5£? OTF?oF bfs ) SBF8toF( BF RY RFACTOR , bfs ) 

#define RHAT VFtoBF ( v. bfv, n ) VFtoBFS ( BF RY FACTOR, v bfv, n ) 
Jdefine RHAT_VBFtoF ( bfv, v, n ) VBF8toF( BF_RY_RFACTOR , bfv, v, n ) 

aH(af : np Y SFtoBF ( s ) SFtoBF32( BF RY FACTOR, s ) 

Jdefine Y SBFtoF bfs ) SBF32toF( BF RY RFACTOR, bfs > 

Jdefine Y VFtoBF v, bfv, n ) VFtoBF32 ( BF RY FACTOR, v bfv, n > 

Jdefine Y VBFtoF bfv, v n ) VBF32toF( B F_R Y__R FACTOR , bfv, v, n ) 
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define MUDLIBJ}ECR_VIRT_USER( ptovjnap, phys_user, virt_user ) \ 

I \ . 

--virt user; \ 

if ( virt user < 0 ) { \ 

--phys user; \ 

virt_user = ptovjnap (phys_user J - 1; \ 

, ,x ~ 

((define MUDLIB_INCR_VIRT_USER( ptov_map, phys_user, virt_user ) \ 
{ \ 

++virt user; \ . 
if { virt user == ptovjnap [phys_user] ) ( \ 

++phys user; \ 

virt user = 0; \ 

> ,x ~ 

**/ 



int mudlib get CorrO offset ( 

unsigned char *ptovjnap, 
int num fingers, 

tot virt_users, 



) 



int 
*/ 
int 
int 
*/ 



start phys user, 
start virt_user 



int mudlib get CorrO size ( 

unsigned char *ptov_map, 
int num fingers, 

tot virtjjsers, 



int 

*/ 

int 

int 

*/ 

int 

int 



start phys user, 
start_virt_user , 

end phys user, 
end virt user 



/* no more than 256 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start jphys_user] 



/* no more than 256 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 



) 



int mudlib get Corrl offset ( 

unsigned char *ptov_map, 
int num fingers, 
int tot_virt_users , 



) 



int start phys user, 
int start_virt_user 

*/ 



int mudlib get Corrl size ( 

unsigned char *ptov_map, 
int num fingers, 
int tot virt_jusers, 



*/ 

int 

int 

V 

int 
int 



start phys user, 
start_virt_user, 

end phys user, 
end virt user 



/* no more than 256 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start _phys_user] 



/* no more than 256 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 



5 



Page No. 264 



EV 093 931 868 US 
Page No. 291 

mudlib.h 



2/23/2001 



int mudlib get R0 offset ( 

unsigned char *ptov_map, 
int tot_virt_users, 
*/ 

int start phys user, 
int start_virt_user 
*/ 

) ; 

int mudlib get R0 size ( 

unsigned char *ptov_map, 
int tot_virt_users , 



*/ 
int 
int 

*/ 

int 

int 



start phys user, 
start_virt_user, 

end phys user, 
end virt user 



) 



int mudlib get Rl offset ( 

unsigned char *ptov_map, 

int tot_virt_users, 

*/ 

int start phys user, 
int start_virt_user 

*/ 

) ; 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 

zero-based index into ptov map */ 
must be < ptov_map (start _phys_user] 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start_phys_user] 

zero-based index into ptov map */ 
must be < ptov_map [end_phys_user] */ 



no more than 256 virts. per phys */ 
sum of ptovjmap over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start_phys_user] 



int mudlib get Rl size ( 

unsigned char *ptov_map, 
int tot virt_users, 



*/ 

int 

int 

*/ 

int 

int 



start phys user, 
start_yirt_user, 

end phys user, 
end virt user 



/* 
/* 

/* 
/* 

/* 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start _jphys_user] 

zero-based index into ptov map */ 
must be < ptov_map [end_phys_user] */ 



) 



int 



mudlib get num_virt_users ( 

unsigned char *ptov map, 
int start phys user, 
start virt user, 



int 
*/ 
int 
int 



end phys user, 
end virt user 



) 



void mudlib get end user_pair { 

unsigned char *ptov map, 
int start phys user, 
start_virt_user, 



int 

*/ 
int 
int 
int 



num virt users, 
*end phys user, 
*end virt user 



no more than 256 virts. per phys */ 
zero-based index into ptov map */ 
must be < ptov_map [start_phys_user] 

zero-based index into ptov map */ 
must be < ptov_map [end_j)hys_user] */ 



/* 
/* 
/* 

/* 
/* 
/* 



no more than 256 virts. per phys */ 
zero-based index into ptov map */ 
must be < ptov_map [start_phys_user] 

number from start (must be > 0) */ 
zero-based index into ptov map */ 
will be < ptov_map[*end_phys_user] 



) ; 



void mudlib gen R ( 

COMPLEX BF16 
COMPLEX BF16 
C0MPLEX_BF8 
*/ 



C0MPLEX_BF8 
*/ 



*mpathl bf, 
*mpath2 bf, 
*corr_0_bf , 

*corr 1 bf, 



/* adjusted for starting physical user 
/* adjusted for starting physical user 
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unsigned char *ptov_map, 
float *bf scalep, 
*inv_scalep, 



float 

*/ 
float 
char 
BF8 
BF8 
BF8 
BF8 
int 
int 
int 
int 
int 
V 
int 



♦scalep, 
*L1 cachep, 
*R0 upper bf, 
*R0 lower bf, 
*R1 trans_bf, 
*Rlm bf, 
tot phys users, 
tot virt users, 
start phys user, 
start virt user, 
end_phys_user, 

end virt user 



) ; 



2/23/2001 

/* no more than 256 virts. per phys */ 

/* scalar: always a power of 2 */ 

/* adjusted for starting physical user 

/* start at 0 1 th physical user */ 

/* must be 32-byte aligned */ 



/* zero-based ("starting row") */ 
/* relative to start phys user */ 
/* actual number of "rows" to process 

/* relative to end_phys_user */ 



void mudlib 4R_to 3R ( 

BF8 *R0 upper bf , 

BF8 *R0 lower bf, 

BF8 *R1 trans bf, 

char *L1 cachep, 

BF8 *R0 bf, 

BF8 *R1 bf, 

int tot_virt_users 

) ; 

void mudlib_mpic ( BF8 *Bt hat, 
BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) ; 

void 



void 



/* 

* temp names (_v) 
*/ 

int mudlib get CorrO offset v ( 

unsigned char *ptovjnap, 
int num fingers, 
int tot_virt_users, 
*/ 

int start phys user, 
int start_virt_user 
*/ 

) ; 



/* input matrix */ 
/* input matrix */ 
/* input matrix */ 

/* 32K-byte temp, 32-byte aligned */ 
/* output matrix */ 
/* output matrix */ 



/* no more than 256 virts. per phys */ 

/* typically, 4 */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 



mudlib_reformat_corr ( COMPLEX *in_corr, 

COMPLEX BF8 *corr 0 bf, 

COMPLEX BF8 *corr l_bf, 
int num virt users, 

int num_multipath ) ; 

fixed_zidotprx ( COMPLEX SPLIT *A, int I, COMPLEX SPLIT *B, int J, 

COMPLEX_SPLIT *C, int N, int X ); 



int mudlib get Corrl offset v ( 

unsigned char *ptov_map, 
int num fingers, 
int tot_virt_users, 
*/ 

int start_phys_user, 



/* no more than 256 virts. per phys */ 

/* typically, 4 */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov_map */ 
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int start_virt_user 
*/ 

) ; 

int mudlib get R0 offset_v ( 

unsigned char *ptov_map, 
int tot virt users, 
*/ 

int start phys user, 
int start virt_user 
*/ 

); 

int mudlib get RO size v ( 

unsigned char *ptov_map, 

int tot_virt_users, 

*/ 

int start phys user, 
int start_virt_user, 
*/ 

int end phys user, 
int end_virt_user 

); 

int mudlib get Rl offset_v ( 

unsigned char *ptov_map, 

int tot_virt_users, 

*/ 

int start phys user, 
int start_virt_user 
V 

) ; 

int mudlib get Rl size v ( 

unsigned char *ptov_map, 

int tot_virt_users , 

*/ 

int start phys user, 
int start_virt_user , 
*/ 

int end phys user, 
int end_virt_user 

) ; 

#endif /* _MUDLIB_H */ 
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/+ must be < ptov_map (start_phys_user] 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 



/♦no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_j?hys_user] */ 



/* no more than 256 virts. per phys */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 
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^include "mudlib.h" 



ftdefine INDEX 5D TO LIN(aO, al, a2, a3, a4, max al, max a2, max a3 ; max a4) \ 
~({a4)~ + (max_a4) * ( (a3) + (max_a3) * ( (a2) + <max_a2) * ( (al) 



+ (max_al) * (a0))))) 

void mudlib reformat corr ( 
COMPLEX *in_corr, 
COMPLEX BF8 *corr 0 bf, 
COMPLEX BF8 *corr l_bf, 
int num virt users, 
int num_fingers ) 



{ 



int i, j , q, ql ; 

for ( i = 0; i < num_virt users; i++ ) { 

for ( j = (i+D; j < num virt users; j++ > { 
for ( q * 0; q < num_f ingers; q++ ) { 

for ( ql = 0; ql < num fingers; ql++ ) { 

corr 0 bf->real = CORR_SFtoBF( in_corr [INDEX 5D_TO_LIN( 

0, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num fingers) ] .real ) ; 
in_corr [INDEX 5DJTO__LIN( 
0, i, j, ql* q, 
num virt users, 
num virt users, 
num fingers, 
num_f ingers) ] . imag ) ; 

++corr 0 bf; 



corr_0_bf->imag = CORR_SFtoBF( 



for ( i = 0; i < num virt users; i++ ) { 
for ( j = 0; j < num virt users; j++ ) { 
for ( q = 0; q < num_f ingers ; q++ ) { 

for ( ql = 0; ql < num fingers; ql++ ) { 

corr 1 bf->real = CORR_SFtoBF ( in_corr [INDEX 5D_TO_LIN( 

i, i, j, qi, q- 

num virt users, 
num virt users, 
num fingers, 
num fingers) 3 .real ) ; 
corr 1 bf->imag = CORR_SFtoBF( in_corr [INDEX 5D_T0_LIN( 

1, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num_f ingers) ] . imag ) ; 

++corr_l_bf ; 



} 



} 
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^include "mudlib.h" 

void mtrans32 8bit ( 
BF8 *A ( 
BF8 *C, 
*/ 

char *L1 cachep, 
int A ncols, 
int A nrows, 
int C_tcols 

) ; 



void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

) ; 

void mudlib_4R to 3R ( 

BF8 *R0 upper bf , 
BF8 *R0 lower bf , 
BF8 *R1 trans bf f 
char *Ll_cachep, 
*/ 

BF8 *R0 bf, 
BF8 *R1 bf, 
int tot virt_users 



/* logically contiguous input 32 x 32 blocks */ 
/* output blocks separated by 32 * out_tc elements 



/* input matrix */ 

/* input matrix */ 

/* input matrix */ 

/* temp: 32K bytes, 32-byte aligned 

/* output matrix */ 
/* output matrix */ 



BF8 
int 

tcols 



*R0 work; 

i, nrows, R0_tcols, tcols; 



(tot_virt_users + R_MATR I X_AL I GNJ4AS K ) & ~ R_MAT R I X_ AL I G N_MA S K ; 

-= R_MATR I X_AL I GN ) { 



nrows = R_MATRIX ALIGN; 
for ( i = tot virt__users; i > 0; i 
if ( nrows > i ) nrows = i; 

mtrans32_8bit { Rl trans bf, Rl_bf, Ll_cachep, tot_virt_users, 

nrows, tcols ) ; 
Rl trans_bf += (tcols << R_MATR I X_AL I GN_L0G ) ; 
Rl_bf += R_MATR I X_AL I GN ; 

} 

RO work = R0 bf; 
RO tcols = tcols; 
nrows = R MATRIX ALIGN; 

for ( i ="tot virt_users; i > 0; i -= R_MATRIX_ALIGN ) { 
if ( nrows > i ) nrows = i; , 

mtrans32 8bit { R0 lower_bf, R0 work, LI cachep, 1, nrows, tcols ); 
R0 lower bf += (R0 tcols << R MATRIX ALIGN LOG) ; 
R0 work += ((tCols"« R ' MATR I X_AL I GN_LOG ) + R_MATRIX_ALIGN) ; 
R0_tCOls -= R_MATR I X_AL I GN ; 

} 

mtriangle_8bit( R0_upper_bf, R0_bf, tot_virt_users ); 



} 



#if COMPILE_C 

void mtrans32 8bit ( 
BF8 *A, 
blocks */ 
BF8 *C, 



char *L1 cachep, 
int A_ncols, 



/* logically contiguous input A_nrows x A_ncols 

/* output blocks separated by 32 * C_tcols elements 
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int A nrows, 
int C_tcols 

) 

{ 

BF8 *Ap, *Cp; 

int A tcols, C_nrows; 

int i , j ; 

(void)Ll_cachep; 

A tcols = (A ncols + R MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK ; 
C_nrows = R_MATR I X_AL I GN ; 

while ( A ncols ) { 

if ( A ncols < C_nrows ) C_nrows = A_ncols; 
Ap = A; 

CP = C; vi 
for ( i = 0; i < A_nrows; i++ ) { 

for ( j = 0; j < C nrows; j++ ) 
Cp[j * C tcols] = Ap[j] ; 

Ap += A tcols; 

Cp += 1; 

A + = R MATRIX ALIGN; ' /* input travels horizontally */ 

C += (C_tcols"« R MATR I X_ALI GN_LOG ) ; /* output travels vertically */ 
A ncols -= C_nrows; 

void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

) 

^ int A counter, A_tcols, altivec_N, C_tcols; 
int i , j ; 

A counter = (N + R MATR I X_AL I GN_MAS K ) & - R_MATR I X_AL I GN_MASK ; 
C_tcols = A_counter + 1; 

altivec_N = (N + ALT I V EC_AL I GN_MAS K ) & ~ALTI VEC_ALIGN_MASK ; 

for ( i = 0; i < N; i + + ) { 

for ( j = 0; j < altivec_N; j++ ) 
C[j] = A[j] ; 

--altivec N; 

Altcols 11 ^^ counter + R_MATR I X__AL I GN_MAS K ) & - R_MATR I X_AL I GN_MAS K ; 
A += (A tcols + 1) ; 
C += C_tcols; 

#endif /* COMPILER */ 
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MC Standard Algorithms -- PPC Macro Language Version 



File Name: mtrans32 8bit.mac 

Description: Perform N_tiles 32 x 32 byte transposes 



void mtrans32 8bit { 
BF8 *A # 
BF8 *C, 

char *L1 cache, 
int A ncols, 
int A nrows, 
int C tcols 



BF8 *Ap, *Cp; 

int A tcols, C_nrows; 

int i , j ; 



contiguous input 32 x 32 blocks 
output blocks separated by 
32 * out tc elements 



A tcols = (A ncols + R MATRIX ALIGN_MASK) & 

~R MATRIX ALIGN_MASK; 
C_nrows = R_MAT R I X_AL I GN ; 

while ( A ncols ) { 

if ( A ncols < C_nrows ) C_nrows = A_ncols; 
Ap = A; 

C P = C; . ^ J 

for ( i = 0; l < A_nrows; i++ ) \ 

for ( j = 0; j < C nrows; j++ ) 

Cp[j * C tcols] = Ap[j] ; 

Ap += A tcols; 

Cp += 1; 

} 

A += R MATR I X_AL I GN ; 

C += (C_tcols << R MATR I X_AL I GN_L0G ) ; 
A ncols -= C_nrows; 



} 

Restrictions: 



A, C and LI cache must all be 16-byte aligned. 
C_tcols must be a multiple of 16. 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 
0.0 000913 fpl Created 



#include "salppc. inc" 

^define DO_PREFETCH 1 

Jfdefine LOAD INPUT ( vT, rA, rB ) 
#define LOAD_CACHE ( vT, rA, rB ) 

#define STORE CACHE ( vS, rA, rB ) 
fldefine STORE_OUTPUT ( vS, rA, rB ) 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 

STVX( vS, rA, rB ) 
STVX( vS, rA, rB ) 



^define R MATRIX ALIGN_L0G 5 

^define R MATRIX ALIGN (1 << R MATRIX ALIGN_L0G) 

#define R_MATRIX_ALIGN_MASK ( R_MATR I X_AL I GN - 1) 



1 



Page No. 271 



EV 093 931 868 US 
Page No. 298 

mtrans32 8bit.mac 



2/23/2001 



#define ALTIVEC ALIGN_LOG 
tfdefine ALTIVEC ALIGN 
#define ALT I VEC_AL I GN_MAS K 

#if DO PREFETCH 

tfdefine PREFETCH { rA f rB f 
DSTT ( rA, rB, STRM ) \ 
ADD ( rA, rA, DST_BUMP ) 

#else 

tfdefine PREFETCH ( rA, rB, 
#endif 



/ 



(1 << ALTIVEC ALIGN_LOG) 
(ALTIVEC_ALIGN - 1) 

STRM, DST_BUMP ) \ 



STRM, DST_BUMP ) 



Four permute vectors for output stage 
**/ 

RODATA_S ECT I ON ( 5 ) 

START_L_ARRAY ( local_table ) 

L PERMUTE MASK( 0x00010405, 0x08090c0d, 

L PERMUTE MASK ( 0x02030607, OxOaObOeOf, 

L PERMUTE MASK ( 0x00020406, 0x080a0c0e, 

L PERMUTE MASK ( 0x01030507, 0x090b0d0f, 



0x10111415, 
0x12131617, 
0x10121416, 
0x11131517, 



0xl8191cld ) 
Oxlalblelf ) 
0xl81alcle ) 
0xl91bldlf ) 



END_ARRAY 
/** 

Input parameters 
**/ 

#define A r3 

#define C r4 

#define Ll_cache r5 

^define NC r6 

#define NR r7 

^define TCC r8 

#define NC left NC 

#define TCA r9 

#define TCA4 rlO 

#define icount rll 



#define aptrO rl2 

#define aptrl rl3 

#define aptr2 rl4 

#define aptr3 rl5 

#define aindxO rl6 

#define aindxl rl7 

#define aindx2 rl8 

#define aindx3 rl9 

#define cptrO r20 

#define cptrl r21 

#define cptr2 r22 

#define cptr3 r23 

#define cindxO r24 

#define cindxl r25 

#define cindx2 r26 

#define cindx3 r27 

#define cindx4 aindxO 

#define cindx5 aindxl 

#define cindx6 aindx2 

#define cindx7 aindx3 
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#define out indxO aptrO 

fldefine out indxl aptrl 

tfdefine out indx2 aptr2 

#define out_indx3 aptr3 

^define cptr cptrO 

fldefine outptrO cptrl 

tfdefine outptrl cptr2 

tfdefine TCC4 cptr3 



#define tptr 
#define temp 



icount 
aptr3 



#define Cbump rO 
ttdefine dstp rO 
#define dst_code r28 



2/23/2001 



/** 




G4 registers 








#define aOO 


vO 


#define aOl 


vi 


ffdetine auz 


V z. 


Adeline aoi 


VJ 


tfdefine alO 


v4 


#define all 


v5 


Adeline aiz 


V D 


Adeline ai3 


V / 


Adefine a20 


v8 


Adefine a21 


v9 


Adefine a22 


V1U 


Adeline a23 


V1J. 


Adefine a30 


vl2 


Adefine a31 


vl3 


Adeline aJ2 


V ±*± 


ttueiine ajj 


Vl5 


Adefine cOO 


vl6 


Adefine cOl 


vl7 


Adefine c02 


v!8 


Adefine c03 


vl9 


#define clO 


v20 


#define ell 


v21 


Adefine cl2 


v22 


Adefine cl3 


V23 


#define c20 


COO 


Adefine c21 


COl 


#define c22 


C02 


#define c23 


c03 


#define c30 


clO 


Adefine c31 


ell 


#define c32 


cl2 


#define c33 


cl3 


Adefine vtO 


v24 


#define vtl 


v25 


Adefine vt2 


v26 


tfdefine vt3 


v27 


Adefine vt4 


cOO 


Adefine vt5 


cOl 
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#def ine 


vt6 


c02 


#def ine 


vt7 


c03 


#def ine 


vpO 


v28 


#def ine 


vpl 


v29 


#def ine 


vp2 


v30 


tfdefine 


vp3 


v31 


#def ine 


CO 


aOO 


#def ine 


cl 


aOl 


#def ine 


c2 


a02 


#def ine 


c3 


a03 


#def ine 


c4 


alO 


#def ine 


C5 


all 


#def ine 


c6 


al2 


ttdef ine 


c7 


al3 



tfdefine outO a20 

#define outl a21 

ttdefine out2 a22 

#define out3 a23 

tfdefine out4 a30 

#define out5 a31 

^define out6 a32 

^define out7 a33 

/** 
Text begins 

**/ 

FUNC PROLOG 

ENTRY_5{ mtrans32_8bit, A, C, Ll_cache, N, TCC ) 
SAVE rl3 r28 

USE_THRU_v31 ( VRSAVE_COND ) 

ADDI ( TCA, NC, R MATR I X_AL I GN_MASK ) 
CMPWI{ NC left, 32 ) 

RLWINM ( TCA, TCA, 0, 0, (31 - R_MATR I X_AL I GN_LOG ) ) 
LA ( tptr, local table, 0 ) 

MAKE STREAM CODE IIR( dst_code, 64, 4, TCA ) 



LVX( vpO, 
ADDI ( tptr, 

LVX( vpl, 
ADDI ( tptr, 
XORK temp, 

LVX ( vp2, 
ADDI ( tptr, 
SLWI( TCA4, 

LVX( vp3, 

BLE ( cont ) 



0, tptr ) 
tptr, 16 ) 
0, tptr ) 
tptr, 16 ) 
A, 32 ) 
0, tptr ) 
tptr, 16 ) 
TCA, 2 ) 
0, tptr ) 



ANDI C( temp, temp, 32 ) 
BR ( cont ) 



Outer loop transposes 2 (or 1 at end) 32 x 32 tiles per trip 
★ */ 

LABEL ( outer_loop ) 
/* { V 

CMPWK NC_left, 32 ) 
LABEL ( cont ) 

ADD ( dstp, A, TCA4 ) /* start prefetch advanced */ 

MR ( aptrO, A ) 
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ADD ( dstp, dstp, TCA ) /* advanced further */ 

LI( aindxO, 0 ) 

ADD ( aptrl, aptrO, TCA ) 

LI( aindxl, 16 ) 

ADD ( aptr2, aptrl, TCA ) 

MR( cptrO, LI cache ) 

ADD ( aptr3, aptr2, TCA ) 

ADDI ( cptrl, cptrO, 512 ) 

L LOAD 1 INPUT(°aOO / aptrO, aindxO ) /*** begins next sequence ***/ 
LI{ cindxl, 128 ) 

LOAD INPUT ( alO, aptrl, aindxO ) 
LI( cindx2, 256 ) 

LOAD INPUT ( a20, aptr2, aindxO ) 
LI( cindx3, 384 ) 

LOAD INPUT ( a30, aptr3, aindxO ) 
MR ( i count, NR ) 

BLE ( input_loop_dol ) 

LI( aindx2, 32 ) /-these are used only in two tile loop */ 

LOAD INPUT ( a02, aptrO, aindx2 ) 
LI( aindx3, 4 8 ) 

LOAD INPUT ( al2, aptrl, aindx2 ) 
ADDI ( cptr2, cptrl, 512 ) 

LOAD INPUT ( a22, aptr2, aindx2 ) 
ADDI ( cptr3, cptr2, 512 ) 

LOAD_INPUT( a32, aptr3 , aindx2 ) 

/** 

Top of input loop processes a 4 x 64 byte tile each trip 
**/ 

LABEL ( input_loop_do2 ) 
/* { */ 

PREFETCH ( dstp, dst code, 0, TCA4 ) 

M ™ ( itO? n aOo! C a20;' "/.'vtO = aOO t O-33 a20[0-3] a00[4-7, a20[4-7] */ 

LO ^RGLK2? 0 a60? P a20) - aOO [8-b] a20 t 8-b] a00 t c-f] a20[c-f] */ 

L ™HW^x al ai0? P a30) = .10 [0-31 a30 t 0-3] .1014-7] a30 [4 -7] */ 

L °ZSvt3 a2 al0? P a30) = .10 [8-b] .10 [8-b] .30lcf] a30[c-f] V 

LOAD__INPUT( a31, aptr3, aindxl ) 

VMRGHW(cOO, vtO, vtl) /* cOO = aOO [0-3] alO[0-3] a20[0-3] a30 [0-3] */ 

STORE CACHE( cOO, cptrO, cindxO ) . nf , 7l w 

VMRGLW(c01, vtO, vtl) /* cOl = a00[4-7] al0[4-7] a20[4-7] a30[4-7] / 

STORE CACHE( cOl, cptrO , cindxl ) iAfo ul onfQ _ n rft hl w 

VMRGHW(c02, vt2, vt3) /* c02 = a00[8-b] al0[8-b] a20[8-b] a30 [8-b} */ 

STORE CACHE { c02, cptrO, cindx2 ) 

VMRGLW(c03, vt2, vt3) /* c03 = aOO [c-f] al0[c-f] a20[c-f] a30 [c-f 3 / 

STORE_CACHE( c03, cptrO, cindx3 ) 

VMRGHW(vtO, aOl, a21) /* vtO = aOl [0-3] a21 [0-3] a01[4-7] a21[4-7] */ 

LOAD INPUT( a03, .aptrO, aindx3 ) . 1fp fl w 

VMRGLW(vt2, aOl, a21) /* vt2 = aOl [8-b] a21 [8-b] a01[c-f] a21[c-f] */ 

LOAD INPUT( al3, aptrl, aindx3 ) . 

VMRGHW(vtl, all, a31) /* vtl = all[0-3] a31 [0-3] all[4-7] a31 [4-7] / 

LOAD INPUT( a23, aptr2, aindx3 ) . f1 

VMRGLW(vt3, all, a31) /* vt3 = all [8-b] all [8-b] a31[c-f] a31[c-f] */ 

LOAD_INPUT( a33, aptr3, aindx3 ) 

VMRGHW(cl0, vtO, vtl) /* clO = aOl [0-3] all [0-3] a21 [0-3] a31[0-3] */ 
STORE CACHE{ clO, cptrl, cindxO ) 

VMRGLW(cll, vtO, vtl) /* ell = a01[4-7] all [4-7] a21[4-7] a31[4-7] / 
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S ^oSS. C vt2.T t l}' aOlte-b] .Xl[8-W 321.8-b, .31I8-M •/ 

S ^ G SS. C v 2 t2l TtS* C W- aOHc-fl alltc-fl .2l[c-£] a31lc-f, V 
STORE_CACHE( cl3, cptrl , cindx3 ) 

BLE ( f lush_input_loop_do2 ) 

ADD ( aindxO, aindxO, TCA4 ) /* bump for next load sequence */ 
ADD { aindxl, aindxl, TCA4 ) 
ADD ( aindx2, aindx2, TCA4 ) 
ADD ( aindx3, aindx3, TCA4 ) 

VMRGHW(vtO, a02, a22) /* vtO = a02 [0-3] a22[0-3J a02 [4-7] .22 ![4-7] */ 
LOAD INPUT( aOO, aptrO, aindxO ) /*** begins next se ^ c * ' n 
VMRGLW(vt2, a02, a22) /* vt2 = a02 [8-b] a22[8-b] a02[c-f] a22[c-f] / 

L °^Kti?°ai2? P a32) = .12 [0-31 a32[0-3] al2[4-7] a32[4-7] */ 

^ R ^3? 1 ai 2 ? P al2) 'W - .12 [8-b] .12[8-bl .32[c-f] .32[c-£] V 
LOAD_INPUT( al2, aptrl, aindx2 ) 

VMRGHW(c20, vtO, vtl) /* c20 = a02 [0-3] al2[0-3] a22[0-3] a32 [0-3] */ 

S ^LlwS, C v2o, C K' .02 14-7] .12M-7, a22[4-7] a32[4-7] V 

"SSoSSSi.^i.^S' C /^22 ) -a02 ( 8-b] .12 18-b] a22[8-b] a32 [8-b] V 

STORE CACHE( c22, cptr2, cindx2 ) (k/ 

VMRGLW(c23, vt2, vt3) /* c23 = a02[c-f] al2[c-f] a22[c-f] a32 [c £J / 
STORE_CACHE< c2 3, cptr2, cindx3 ) 

VMRGHW(vtO, a03. 323) /* vtO = a03[0-3) a23 [0-3] a03[4-7] a23[4-7] */ 

"SRSSKa! 2 !^) "W - a03[8-b, a23 [8-b] a03[c-f] .23 [cf] V 

L °a G HW^l? 2 ai3? P af3i = .13 10-3] a33[0-3] al3 t 4-7, a33 ( 4-7] V 

LOAD INPUT( a30, aptr3 , aindxO ) f i ,nr- fi */ 

VMRGLW(vt3, al3, a33) /* vt3 = al3 [8-b] al3 [8-b] a33[c-f] a33[c-f] */ 
LOAD_INPUT( a32, aptr3 , aindx2 ) 

VMRGHW(c30, vtO, vtl) /* c30 = a03 [0-3] al3[0-3] a23 [0-3] a33[0-3] */ 

S ^(c31. C v?0, C vS' a03[4-7] al3[4-7] a23[4-7] a33[4-7] V 

ST ^c32, C ^2, C vS' C1 /" d c32 ) =a03[8-b] .13 [8-b, a23 [8-b] a33[8-b, V 

STORE CACHE ( c32, cptr3 , cindx2 ) 

VMRGLW(c33, vt2, vt3) /* c33 = a03 [c-f] al3 Ec-f] a23 [c-fl a33[c-f] / 

STORE_CACHE( c33, cptr3 , cindx3 ) 

ADDK cindxO, cindxO, 16 ) /* bump for next store sequence */ 

ADDI ( cindxl, cindxl, 16 ) 

ADDK cindx2, cindx2, 16 ) 

ADDK cindx3, cindx3, 16 ) 

BR ( input_loop_do2 ) 

LABEL ( f lush_input_loop_do2 ) 

VMRGHVMvtO, a02, a22) /* vtO = a02 [0-3] a22[0-3] a02[4-7] a22[4-7] */ 

VMRGLW(vt2, a02, a22) /* vt2 = a02 8-b a22 8-b a c- ^2 c-f / 

VMRGHW (vtl , al2, a32) /* vtl = al2 0-3 a32 0-3 al2 4-7 a 2 4- / 

VMRGLW (vt3 # al2, a32) /* vt3 = al2 [8-b] al2[B-b] a32[c-f] a32[c-f] / 

VMRGHW(c20, vtO, vtl) /* c20 = a02 [0-3] al2[0-3] a22[0-3] a32 [0-3] */ 

STORE CACHE ( c20, cptr2, cindxO ) . 

VMRGLW <c21, vtO, vtl) /* C21 - a02[4-7] al2 [4-7] a22[4-7] a32[4-7] */ 

STORE_CACHE( c21, cptr2, cindxl ) 
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VMRGHW(c22, vt2, vt3) /* c22 = a02 [8-b] al2 [8-b] a22[8-b] a32[8-b] */ 

STORE CACHE ( c22, cptr2, cindx2 ) 

VMRGLW(c23, vt2, vt3) /* c23 = a02 [c-f ] al2[c-f] a22[c-f] a32[c-f] */ 

STORE_CACHE( c23, cptr2, cindx3 ) 

VMRGHW(vtO, a03, a23) /* vtO = a03 [0-3] a23 [0-3] a03[4-7] a23[4-7] */ 

VMRGLW(vt2, a03, a23) /* vt2 = a03 [8-b] a23 8-b a c-f a23 c-f */ 

VMRGHW(vtl. al3 f a33) /* vtl = al3 [0-3] a33 0-3 al3 4-7 a33 4-7 */ 

VMRGLW(vt3, al3, a33) '* = an ffi-bl al3r8-bl a33[c-f] a33[c-f] */ 



VMRGHW(c32, vt2, vt3) 



/* c22 = 


a02 [8- 


b] 


al2 [8- 


b] 


cindx2 ) 








f ] 


/* C23 = 


a02 [c- 


f] 


al2 [c- 


cindx3 ) 










/* vtO = 


a03 [0- 


■3] 


a23 [0- 


3] 


/* vt2 = 


a03 [8- 


Dj 


-Ol [Q 
dZJ lo - 


HI 

•DJ 


/* vtl = 


al3 [0- 


■3] 


a33 (0- 


■3] 


/* vt3 = 


al3 [8- 


-b] 


al3 [8- 


•b] 


/* c30 = 


a03 [0- 


-3] 


al3 [0- 


-3] 


cindxO ) 








-7] 


/* c31 = 


a03 [4 


-7] 


al3 [4 


cindxl ) 








-b] 


/* c32 = 


a03 [8 


-b] 


al3 [8 


cindx2 ) 








-f] 


/* c33 - 


a03 [c 


-f] 


al3 [c 



VMRGHW(c30, vtO, vtl) 
TORE CACHE ( c30, cptr3 , v.x,*v~w / 
VMRGLW(c31, vtO, vtl) /* c31 = a03 [4-7] al3[4-7] a23[4-7] a33[4-7] */ 



STORE CACHE ( c32, cptr3 , cindx2 ) 

VMRGLW<c33, vt2, vt3) /* c33 = fn-f 1 al3rc-f] a23[c-f] a33[c-f] */ 

STORE_CACHE( c33, cptr3 f cindx3 ) 

MR( outptrO, C ) /* set for output loop in current pass */ 
SLWI( Cbump, TCC, 6 ) 
ADDI ( A, A, 64 ) 

ADD ( C, C Cbump ) /* bump C for next pass */ 

LI{ icount, 64 ) /* set icount for 2 tiles */ 

BR( output start ) /* join to common output loop */ 



/** 

Top of input loop processes a 4 x 32 byte tile each trip 
**/ 

LABEL ( input_loop_dol ) 
/* { */ 

PREFETCH ( dstp, dst code, 0, TCA4 ) 

M v^RGHW{vt0? n aO0^ C a2O^ "/* vtO = aOO [0-3] a20 [0-3] a00[4-7] a20[4-7] */ 
LOAD INPUT( a01, aptrO, aindxl ) 

VMRGLW{vt2, a00, a20) /* vt2 = aOO [8-b] a20 [8-b] a00[c-f] a20[c-f] */ 
LOAD INPUT< all, aptrl, aindxl ) 

VMRGHW(vtl, alO, a30) /* vtl = alO [0-3] a30 [0-3] al0[4-7] a30[4-7] */ 
LOAD INPUT ( a21, aptr2, aindxl ) 

VMRGLW(vt3, alO, a30) /* vt3 = alO [8-b] alO [8-b] a30[c-f] a30[c-f] */ 
LOAD__INPUT( a31, aptr3, aindxl ) 

VMRGHW(c00, vtO, vtl) /* c00 = aOO [0-3] alO [0-3] a20[0-3] a30[0-3] */ 

STORE CACHE{ cOO, cptrO, cindxO ) 

VMRGLW(c01, vtO, vtl) /* cOl = a00[4-7] al0[4-7] a20[4-7] a30[4-7] */ 

STORE CACHE( cOl, cptrO, cindxl ) 

VMRGHW(c02, vt2, vt3) /* c02 = aOO [8-b] alO [8-b] a20[8-b] a30[8-b] */ 

STORE CACHE( c02, cptrO, cindx2 ) 

VMRGLW<c03, vt2, vt3) /* c03 = a00[c-f] alO [c-f] a20 [c-f ] a30[c-f] */ 

STORE_CACHE( c03, cptrO, cindx3 ) 

BLE ( f lush_input_loop_dol ) 

ADD { aindxO, aindxO, TCA4 ) /* bump for next load sequence */ 
ADD ( aindxl, aindxl, TCA4 ) 

VMRGHW(vtO, aOl, a21) /* vtO = aOl [0-3] a21 [0-3] a01[4-7] a21[4-7] */ 

LOAD INPUT ( aOO, aptrO, aindxO ) /*** begins next sequence ***/ 

VMRGLW(vt2, aOl, a21) /* vt2 = a01[8-b] a21 [8-b] aOl [c-f ] a21[c-f] */ 
LOAD INPUT( alO, aptrl, aindxO ) 

VMRGHW(vtl, all, a31) /* vtl = all [0-3] a31 [0-3] all[4-7] a31[4-7] */ 
LOAD INPUT( a20, aptr2, aindxO ) 

VMRGLW(vt3, all, a31) /* vt3 = all [8-b] all [8-b] a31[c-f] a31[c-f] */ 
LOAD_INPUT( a30, aptr3, aindxO ) 

VMRGHW(clO, vtO, vtl) /* clO = aOl [0-3] all [0-3] a21 [0-3] a31 [0-3] */ 
STORE_CACHE( clO, cptrl, cindxO ) 
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VMRGLVMcll, vtO, vtl) /* Cll = a01[4-7] all 14-7] a21[4-7] a31[4-7] */ 

"SSoSSSl.^i.^S' .01CB-W .11I8-W .2118-W .31E.-W •/ 

"SSoSfiSl.^i.^S' aOUc-f] alHc-f] a2Hc.fl a31 t c-f] V 

STORE_CACHE< cl3, cptrl, cmdx3 ) 

ADDK cindxO, cindxO, 16 ) /* bump for next store sequence */ 
ADDI ( cindxl, cindxl, 16 ) 
ADDK cindx2, cindx2, 16 ) 
ADDI ( cindx3, cindx3, 16 ) 



BR ( input_loop_dol ) 
LABEL ( f lush_input_loop_dol ) 



VMRGHW(vtO, 
VMRGLW(vt2, 
VMRGHW(vtl, 
VMRGLW<vt3, 

VMRGHW(clO, 
STORE CACHE { 

VMRGLW(cll, 
STORE CACHE { 

VMRGHW(cl2, 
STORE CACHE ( 

VMRGLW{cl3 / 
STORE CACHE ( 



aOl, a21) 
aOl, a21) 



all, 
all, 



a31) 
a31) 



/* 
/* 
/* 
/* 



vtO 
vt2 
vtl 
vt3 



vtO, vtl) 
clO, cptrl, 

vtO, vtl) 
cll, cptrl, 

vt2, vt3) 
cl2, cptrl, 

vt2, vt3) 
cl3, cptrl, 



/* clO = 
cindxO ) 

/* cll = 
cindxl ) 

/* cl2 = 
cindx2 ) 

/* Cl3 = 

cindx3 ) 



a01[0-3] a21[0-3] a01[4-7] a21[4-7] */ 

a01[8-b] a21[8-b] a01[c-f] a21[c-f] */ 

allCO-3] a31[0-3] all[4-7] a31[4-7] */ 

all[8-b] all[8-b] a31tc-f] a31 [c-f 3 */ 

aOl[0-3] all [0-33 a21[0-3] a31[0-33 */ 

a01[4-7] all [4-7] a21[4-73 a31[4-73 */ 

a01[8-b3 all[8-b] a21 [8-b] a31[8-b3 */ 

a01[c-f] all [c-f] a21[c-f] a31 [c-f 3 */ 



MR ( outptrO, C ) 
SLWI( Cbump, TCC, 
ADDI { A, A, 32 ) 
ADD ( C, C, Cbump ) 
LI( icount, 32 ) 



5 ) 



/* set for output loop in current pass */ 

/* bump C for next pass */ 
/* set icount for 1 tile */ 



Second stage of transposition, write output 
.*/ 

LABEL ( output_Start ) 



CMPW_CR( 6, icount, NC_left ) 



MR ( cptr, LI cache ) 
SLWI{ TCC4, TCC, 2 ) 
LI ( cindxO, 
LI ( cindxl, 
LI( cindx2, 2*16 ) 
LI{ cindx3, 3*16 ) 
LI( cindx4, 4*16 ) 
LI ( cindxB, 
LI ( cindx6, 



0 ) 
16 ) 



5*16 ) 
6*16 ) 



BLE_CR( 6, PC OFFSET ( 8 ) ) 
MR ( icount, NC_left ) 

LI( cindx7, 7*16 ) 

SUB{ NC_left, NC_left, icount ) 

ADDIC C( icount, icount, -4 ) 
LI( out indxO, 0 ) 

LOAD CACHE ( c0, cptr, cindxO ) 
ADD ( out indxl, out indxO, TCC ) 

LOAD CACHE ( cl, cptr, cindxl ) 
ADD { out indx2, out indxl, TCC ) 

LOAD CACHE ( c2, cptr, cindx2 ) 
ADD ( out_indx3, out_indx2, TCC ) 
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LOAD CACHE ( c3, cptr, cindx3 ) 

ADDI ( outptrl, outptrO, 16 ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM( vtO, CO, cl, vpO ) 

LOAD CACHE ( c5 , cptr, cindx5 ) 

VPERM( vtl, CO, Cl, vpl ) 

LOAD CACHE ( c6, cptr, cindxG ) 

VPERM( vt2, c2, c3, vpO ) 

LOAD CACHE ( c7, cptr, cindx7 ) 

VPERM( vt3, c2, c3, vpl ) 
ADDI ( cptr, cptr, 128 ) 
BR ( output_mloop ) 



2/23/2001 



Loop outputs four 32 byte rows 
**/ 

LABEL ( output loop ) 

ADDIC_C( icount, icount, -4 ) 
ADDI ( cptr, cptr, 128 ) 



STORE OUTPUT ( OutO, 

VPERM{ OUt4, vt4, 
STORE OUTPUT ( OUt4, 

VPERM( out5, vt4, 
STORE OUTPUT ( outl, 

VPERM( Out6, vt5, 
STORE OUTPUT { out 5, 

VPERM( out7, vt5, 



outptrO, out_indx0 ) 
vt6, vp2 ) t 
outptrl, out_indx0 ) 
vt6, vp3 ) 

outptrO, out_indxl ) 
vt7, vp2 ) 

outptrl, out_iridxl ) 
vt7, vp3 ) 



STORE OUTPUT ( out 2, outptrO, out_indx2 ) 

VPERM( vtO, CO, cl, vpO ) 

STORE OUTPUT ( out6, outptrl, out_indx2 ) 

VPERM( vtl, c0, Cl, vpl ) 

STORE OUTPUT ( out3, outptrO, out_indx3 ) 

VPERM( vt2, c2, c3, vpO ) 

STORE OUTPUT ( out7, outptrl, out_indx3 ) 

VPERM( vt3, C2, C3, vpl ) 

ADD ( outptrO, outptrO, TCC4 ) 
ADD ( outptrl, outptrl, TCC4 ) 

LABEL ( output mloop ) 

BLE ( flush output_loop ) 

LOAD CACHE ( cO, cptr, cindxO ) 

VPERM( vt4, c4, C5, vpO ) 
LOAD CACHE ( cl , cptr, cindxl ) 

VPERM( vt5, C4, c5, vpl ) 
LOAD CACHE ( c2, cptr, cindx2 ) 

VPERM( vt6, c6, C7, vpO ) 
LOAD CACHE { c3 , cptr, cindx3 ) 

VPERM( vt7, c6, c7, vpl ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM( outO, vtO, vt2, vp2 ) 

LOAD CACHE ( c5, cptr, cindx5 ) 

VPERM{ outl, vtO, vt2, vp3 ) 

LOAD CACHE { c6, cptr, cindx6 ) 

VPERM( out2, vtl, vt3, vp2 ) 

LOAD CACHE ( c7, cptr, cindx7 ) 

VPERM( OUt3, vtl, vt3, vp3 ) 

BR ( output_loop ) 

LABEL ( f lush_output_loop ) 

VPERM( Vt4, C4, c5, vpO ) 
VPERM( vt5, c4, C5, vpl ) 
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VPERM( vt6, c6, c7, vpO ) 
VPERM( vt7, c6, c7, vpl ) 

CMPWK icount, -3 ) 

VPERM( outO, vtO, vt2, vp2 ) 

STORE OUTPUT ( outO, outptrO, out_indxO ) 

VPERM( out4, vt4, vt6, vp2 ) 

STORE OUTPUT ( out4, outptrl, out_indxO ) 
BEQ( oloop_next ) 

CMPWK icount, -2 ) 

VPERM( outl, vtO, vt2, vp3 ) 

STORE OUTPUT( outl, outptrO, out_indxl ) 

VPERM( OUt5, vt4, vt6, vp3 ) 

STORE OUTPUT ( out 5, outptrl, out_indxl ) 
BEQ{ oloop_next ) 

CMPWK icount, -1 ) 

VPERM( out2, vtl, vt3, vp2 ) 

STORE OUTPUT( out2, outptrO, out_indx2 ) 

VPERM( out6, vt5, vt7, vp2 ) 

STORE OUTPUT ( out6, outptrl, out_indx2 ) 
BEQ ( oloop_next ) 

VPERM{ OUt3, vtl, vt3, vp3 ) 

STORE OUTPUT ( out3, outptrO, out_indx3 ) 

VPERM( out 7, vt5, vt7, vp3 ) 

STOREJDUTPUT ( out7, outptrl, out_indx3 ) 



Next four rows of C? 
* */ 

LABEL ( oloop next ) 

BLT_CR ( 6, outer_loop ) 



/* branch if icount < NC_left */ 



/** 

Exit routine 
**/ 

LABEL ( ret ) 

FREE THRU v31 ( VRSAVE_COND ) 

REST rl3_r28 

RETURN 



FUNC EPILOG 
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MC Standard Algorithms PPC Macro language Version 



File Name: mtriangle_8bit .mac. 

Description: Move from an upper triangular matrix stored 
as a series of 32-line rectangles, each of 
width 32 elements less than its immediate 
predecessor to the upper triangle of an 
full N x N matrix. 

mtriangle_8bit ( char *A, char *C, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000605 jg Created 



#include "salppc . inc" 

tfdefine LOAD A( vT, rA, rB ) 
#define LOAD C( vT, rA, rB ) 
#define STORE_C( vS, rA, rB ) 

#define R MATRIX ALIGN_LOG 
#define R MATRIX ALIGN 
#define R_MATRIX_ALIGN_MASK 

tfdefine ALTIVEC ALIGN_LOG 
#def ine ALTIVEC ALIGN 
tfdefine ALTIVEC_ALIGN_MASK 

/** 

Input parameters 



* * 1 






#def ine 


A 


r3 


#def ine 


C 


r4 


#def ine 


N 


r5 


#def ine 


A tcols 


r6 


#def ine 


C tcols 


r7 


#def ine 


altivec N 


r8 


#def ine 


A counter 


r9 


#def ine 


indexO 


rlO 


#def ine 


indexl 


rll 


#def ine 


index2 


rl2 


#def ine 


index3 


rl3 


#def ine 


count 


rO 


ttdef ine 


aO 


vO 


#def ine 


al 


vl 


#def ine 


a2 


v2 


#def ine 


a3 


v3 


#def ine 


cO 


v4 


#def ine 


shift 


v5 


#def ine 


shif t_incr 


v6 


#def ine 


mask 


v7 


#def ine 


left 


v8 


,#def ine 


right 


v9 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 
STVX( vS, rA, rB ) 

5 

(1 << R MATRIX ALIGN_LOG) 
(R MATRIX ALIGN - 1) 



(1 << ALTIVEC ALIGN_LOG) 
(ALTIVEC ALIGN - 1) 
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FUNC_PROLOG 

ENTRY_3( mtriangle_8bit, A, C # N ) 
SAVE rl3 

USE_THRU_v9 ( VRSAVE_COND ) 

ADDK A counter, N, R MATRIX ALIGN_MASK ) 

VSPLTISW( shift_incr, 8 ) 
ADDK altivec N, N, ALTIVEC ALIGN_MASK ) 

VXOR{ shift, shift, shift ) 
RLWINM ( A counter, A counter, 0, 0, (31 - R ^IX ALKN WG 
RLWINM ( altivec N, altivec N, 0, 0, (31 - ALTIVEC_ALIGN_LOG ) ) 
ADDK C_tcols, A_counter, 1 ) 

LABEL ( oloop ) 

ADDIC C( count, altivec_N, -64 ) 
LOAD C( CO, 0, C ) 

VSPLTISVM mask, -1 ) 
LOAD A( a0, 0, A ) 

VSR0( mask, mask, shift ) 
LK indexO, 16 ) 

VANDC( left, cO, mask ) 
LK indexl, 32 ) 

VAND ( right, aO, mask ) 
LI ( index2, 4 8 ) 

VOR( c0, left, right ) 
STORE C( cO, 0, C ) 
BLE ( dosmall ) 
LK index3, 64 ) 



LABEL ( iloop ) 

LOAD A( a0, A, indexO ) 

ADDIC C( count, count, -64 ) 
LOAD A( al, A, indexl ) 
LOAD A( a2, A, index2 ) 
LOAD A( a3, A, index3 ) 
STORE C( aO, C, indexO ) 

ADDI ( indexO, indexO, 64 ) 
STORE C( al, C, indexl ) 

ADDK indexl, indexl, 64 ) 
STORE C( a2, C, index2 ) 

ADDK index2, index2, 64 ) 
STORE C{ a3, C, index3 ) 

ADDK index3, index3, 64 ) 

BGT ( iloop ) 

LABEL ( dosmall ) 

ADDIC C( count, count, 4 8 ) 
BLE ( windout ) 

LABEL ( sloop ) 

ADDIC C( count, count, -16 ) 
LOAD A( aO, A, indexO ) 
STORE C( aO, C, indexO ) 

ADDI ( indexO, indexO, 16 ) 

BGT ( sloop ) 



LABEL ( windout ) 

DECR_C( N ) t _ . 

VADDUWM( shift, shift, shift_incr ) 
ADDK A counter, A__counter, -1 ) 

ADDI ( A, A, 1 ) n „ . 

ADDK A tcols, A counter, R_MATR I X__AL I GN_MASK ) 
DECR ( altivec_N ) 
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RLWINM ( A tCOls, A_tCOls, 0, 0, (31 - R_MATRIX^_ALIGN_LOG ) ) 
ADD ( C, C, C tCOls ) 
ADD ( A, A, A_tCOls ) 
BNE ( Oloop ) 

FREE THRU_v9( VRSAVE_COND ) 

REST rl3 

RETURN 



FUNC EPILOG 
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#if !defined( SALPPC_H ) 
ttdefine SALPPC_H 



#if 0 
_i.****** 

* * * 



***************** ***************************************************+ 

MC Standard Algorithms -- PPC Version 

******************************************************************** 

* 

File Name: salppc. h * 

Description: SAL macro include file ^ 

Source files should have extension .mac. For example, vadd.mac * 
and must include this file (salppc. h) . + 

To assemble for PPC ucode, use the following basic * 
makefile build rule: t 

.SUFFIXES: .mac .c .s .o * 

* 

.mac.o: A 

cp $* .mac $* . c 4 

ccmc -o $*.s -E $*.c ^ 

ccmc -c -o $* .o $* .s ^ 

rm -f $*.s A 

rm -f $*.c 4 

To compile for C, use the following basic makefile build rule: * 

. SUFFIXES : .mac .c .o * 

.mac.o: t 
cp $*.mac $*.c 

ccmc -DCOMPILE_C ~c -o $*.o $*.c ^ 

rm -f $*.c 4 

The first 8 function arguments are passed in GPR registers * 
r3 - rlO Arguments beyond 8 are passed on the stack and may ' 
be obtained with the GET_ARG8, GET_ARG9, ... GET ARG15 macros. < 
Additional GPR registers should be assigned in ascending order ' 
starting from the last function argument. These may be declared 1 
with the DECLARE_rx[ ry] macros. For example, a function with 
5 arguments that requires 3 additional GPR registers would 
issue: DECLARE r8 rlO. r0 ( if required, should be declared 
separately with the DECLARE rO macro. GPR registers above rl2 
must be saved and restored using the SAVE_rl3 [_ry] and 
REST_rl3 [_ry] macros, respectively. 

FPR registers should be assigned in ascending order starting 
with f0[d0]. These may be declared with the DECLARE_f 0 [_fy] 
or DECLARE dO [ dy] macros. 

For example, DECLARE fO fll. FPR registers above f 13 [dl3] must 
be saved and restored using the SAVE f 14 [ fy] and REST f 14 [_fy] 
or SAVE_dl4 [_dy] and REST_dl4 [__dy] macros, respectively. 

All variables must be assigned a register using the 
pre-processor #define directive. GPR registers are named 
rO - r31; Single precision FPR registers are named fO - £31. 
Double precision FPR registers are named do - d31. Different 
variables may be assigned to the same register as in: 



ttdefine vara 
#define varb 



fl2 
fl2 



Functions must begin with the FUNC_PROLOG macro and end 
with the FUNC EPILOG macro. 
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* 


Macros are 


provided for both Fortran and C entry points. 


+ 

*• 




The GET SALCACHE macro should be used to get the address of 


+ 


* 
* 


the "current" 


salcache buffer into a GPR register. 


* 


* 


Avoid terminating macro lines with a semicolon. 


* 
* 


★ 
* 


The following example demonstrates typical usage: 


•* 
* 


* 
* 


^include "salppc.h" 


+ 


* 


/* 






* 


* 


* assign variables to registers 


* 




*/ 






+ 


* 


#def ine 


A 


r3 


* 


* 


#def ine 


I 


r4 


* 


★ 


#def ine 


B 


r5 


* 


* 


#def ine 


J 


r6 


* 


* 


fldefine 


C 


r7 


* 


* 


#def ine 


K 


r8 


* 


* 


tfdefine 


D 


r9 


* 


★ 


#def ine 


L 


no 


* 


★ 


tfdefine 


N 


rl2 


+ 


* 


#def ine 


EFLAG rll 


* 


* 
* 


#define 


count rll 


* 

* 


* 


ftdef ine 


to 


rl3 


* 


* 


#def ine 


tl 


rl3 


* 


* 


#def ine 


t2 


rl4 


* 


* 


#def ine 


t3 


rl4 


★ 


* 


#def ine 


t4 


rl5 


* 


* 


#def ine 


t5 


rl5 


* 


* 
+ 


#def ine 


t6 


rl6 


* 
* 




#def ine 


aO 


fO 


★ 


* 


#def ine 


al 


fl 


* 


★ 


#def ine 


a2 


f2 


* 


* 


#def ine 


a3 


f3 




* 


#def ine 


bO 


f4 




* 


#def ine 


bl 


f5 


* 


* 


#def ine 


b2 


f6 




* 


#def ine 


b3 


f7 




* 


#def ine 


CO 


f8 


* 


* 


#def ine 


Cl 


f9 


* 


* 


#def ine 


c2 


flO 


* 


* 


#def ine 


c3 


fll 


* 


* 


#def ine 


dO 


fl2 


+ 


* 


#def ine 


dl 


fl3 


* 


+ 


#define 


d2 


fl4 




+ 

★ 


#def ine 


d3 


fl5 


* 
★ 


* 

-*■ 


FUNC_PROLOG 


/* must precede function */ 


* 
* 


* 


#if !defined( 


COMPILE C ) 


* 


* 


U ENTRY (foo > 


* 


★ 


FORTRAN 


DREF 4(1, J, K, L) 


* 


★ 
* 


FORTRAN_ 


_DREF_ARG8 


★ 
* 


* 


U ENTRY (foo) 


+ 


★ 


LI (EFLAG, 


0) 


* 


★ 
* 


BR (common) 




* 
★ 


* 


U ENTRY (foo x ) 


* 


* 


FORTRAN 


DREF 4(1, J, K, L) 


* 


+ 


FORTRAN 


DREF ARG8 


* 


★ 


FORTRAN 


DREF ARG9 


★ 


★ 


#endif 
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ENTRY 10(foo A, I, B, J, C, K, D, L, N, EFLAG) 
DECLARE rl3 rl6 
DECLARE fO fl5 
GET_ARG9( EFLAG ) 

LABEL ( common ) 

SAVE CR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE_LR 

GET ARG8 { N ) 



./* get the 9»th arg (EFLAG) off stack */ * 

* 

* 

/* needed if using fields 2,3 or 4 */ * 

★ 

/* needed if making a function call */ * 
/* get the 8'th arg (N) off stack */ * 



body of function . 



REST CR 
REST rl3 rl6 
REST fl4_fl5 
REST LR 
RETURN 

FUNC EPILOG 



/* must conclude function */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 



* Revision 



Date 



Engineer; Reason 



0.0 960223 jg; Created 

0.1 970109 jfk; Added POSTING BUFFER COUNT and made 

TEST IF DCBZ macro time "stw" instead 
of doing the TEST IF DCBT macro (lwz) 
0.2 970124 jfk; Added SALCACHE ALLOC SIZE , 

ALIGN SALCACHE, CREATE_SALCACHE_FRAME 
DESTROY SALCACHE FRAME 
0.3 970521 jfk; Added SET DCB [TZ] COND macros. 

Made old macros not assemble 
0.4 980813 jfk; Changes SALCACHE ALLOC SIZE for 750 

+ ******************************************^ 

#endif /* header */ 



#include <math.h> 

#define uchar unsigned char 
^define ulong unsigned long 
#define ushort unsigned short 

^define CR _cr 
^define CTR _ctr . 
#define VSCR _vscr 

/* 

* define a structure to represent a VMX register 
*/ 

typedef union { 

char c [16] ; 

uchar uc [16] ; 

short s [8] ; 

ushort us [83 ; 

long 1[4] ; 

ulong ul [4 ] ; 

float f[4]; 
} VMX_reg; 

« define FUNC PROLOG 
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tfdefine FUNC EPILOG \ 
} 

fldefine TEXT_SECTION ( logb2_align ) 
tfdefine DATA_S ECT I ON ( logb2_align ) 
^define RODATA_SECTION( logb2_align ) 
/* 

* macro for C extern declarations 
*/ 

fldefine EXTERN_DATA( symbol ) \ 
extern long symbol; 

^define EXTERN_FUNC( func ) \ 
extern void func( void ); 

/* 

* macro for a global declaration 
V 

fldefine GLOBAL ( symbol ) 
/* 

* macro for a local declaration 
*/ 

#define LOCAL ( symbol ) 
/* 

* macros for creating static arrays 
*/ 

#define START_ ARRAY ( type, name ) \ 
type name## [] = { 

#define START C ARRAY ( name ) START ARRAY ( char, name ) 
tfdefine START UC ARRAY ( name ) START ARRAY ( uchar, name ) 
tfdefine START S ARRAY ( name ) START ARRAY ( short, name ) 
#define START US ARRAY ( name ) START ARRAY ( ushort , name 
#define START L ARRAY { name ) START ARRAY ( long, name ) 
#define START UL ARRAY ( name ) START ARRAY ( ulong, name ) 
tfdefine START_F_ARRAY { name ) START_ARRAY ( float, name ) 

tfdefine END ARRAY \ 

}; 

#define DATA ( dl ) \ 
dl, 

^define DATA2 ( dl, d2 ) \ 
dl, d2, 

fldefine DATA4 { dl, d2, d3, d4 ) \ 
dl, d2, d3, d4, 

#define DATA 8 ( dl , d2, d3, d4 , d5 , d6, d7, d8 ) \ 
dl, d2, d3, d4, d5, d6, d7, d8, 



Jfdef ine 
#def ine 
#def ine 
tfdef ine 
#def ine 
ttdef ine 
#def ine 



C DATA ( dl ) 
UC DATA ( dl ) 
S DATA ( dl ) 
US DATA ( dl ) 
L DATA ( dl ) 
UL DATA { dl ) 
F DATA { dl ) 



DATA ( dl ) 

DATA ( dl ) 

DATA ( dl ) 

DATA ( dl ) 

DATA ( dl ) 

DATA ( dl ) 

DATA ( dl ) 



#if defined ( LITTLE_ENDIAN ) 



EV 093 931 868 US 
Page No. 314 

salppc.h 



DATA2 ( 62, dl ) 
DATA2 ( dl, d2 ) 



ttdefine D_DATA( dl, d2 ) 
tfelse 

fldefine D_DATA ( dl, d2 ) 
ttendif 

^define C DATA 2 ( dl, d2 ) DATA2 ( dl , d2 ) 
Udefine UC DATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 
tfdefine S DATA2 ( dl , d2 ) DATA2 ( dl , d2 ) 
tfdefine US DATA2 ( dl, d2 ) DATA2 ( dl , d2 ) 
tfdefine L DATA2 ( dl , d2 ) DATA2 ( dl, d2 ) 
^define UL DATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 
^define F DATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 



tfdefine C DATA4 ( dl, d2, d3, d4 ) 
#define UC DATA4 < dl, d2, d3 , d4 ) 
^define S DATA4 ( dl , d2, d3, d4 ) 
#define US DATA4 ( dl, d2, d3, d4 ) 
#define L DATA4 ( dl, d2, d3, d4 ) 
#define UL DATA4 { dl, d2, d3, d4 ) 
#define F__DATA4 ( dl , d2, d3 , d4 ) 



DATA4 ( dl, d2, d3, d4 ) 

DATA4 ( dl, d2, d3, d4 ) 

DATA4 ( dl, d2, d3 , d4 ) 

DATA4 ( dl, d2, d3, d4 ) 

DATA4 ( dl, d2, d3 , d4 ) 

DATA4 ( dl, d2, d3, d4 ) 

DATA4 ( dl, d2, d3 , d4 ) 



tfdefine C DATA 8 { dl , d2, 33, 

DATA8 ( dl, d2, d3, d4 
#define UC DATA 8 ( dl, d2 , d3, 

DATA8 ( dl, d2, d3, d4 
tfdefine S DATA8 ( dl , d2, d3, 

DATA 8 ( dl, d2, d3, d4 
#define US DATA 8 ( dl, d2, d3, 

DATA 8 { dl, d2, d3, d4 
tfdefine L DATA 8 ( dl, d2, d3 

DATA 8 ( dl, d2, d3, d4 
#define UL DATA 8 ( dl, d2 , d3, 

DATA 8 ( dl, d2, d3 , d4 
#define F DATA 8 ( dl , d2, d3 

DATA 8 ( dl, d2, d3 , d4 



d4, d5, d6, d7, d8 ) \ 
, d5, d6, d7, d8 ) 
d4, d5, d6, d7, d8 ) \ 
( d5, d6, d7, d8 ) 
d4, d5, d6, d7, d8 ) \ 
, d5, d6, d7, d8 ) 
d4, d5, d6, d7, d8 ) \ 
, d5, d6, d7, d8 ) 
d4, d5, d6, d7, d8 ) \ 
, d5, d6, d7, d8 ) 

d4, d5, d6, d7, d8 ) \ 
, d5, d6, d7, d8 ) 
d4, d5, d6, d7, d8 ) \ 
, d5, d6, d7, d8 ) 
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/* 

* macros for creating vmx permute masks (128-bits) 
*/ 

#if defined ( L I TTL E_END I AN ) 

#define L PERMUTE MUNGE ( 1 ) ( (1) Oxlclclclc ) 

#define S PERMUTE MUNGE { s ) ( (s) Oxlele ) 

#define C_PERMUTE_MUNGE ( c ) ((c)" Oxlf ) 

#define L INDEX MUNGE ( x ) ( (x) 0x3 ) 

#define S INDEX MUNGE ( x ) ( (x) 0x7 ) 

#define C INDEX_MUNGE( x ) { (x) Oxf ) 



tfelse 














#def ine 


L 


PERMUTE MUNGE ( 


1 


) ( 


1 


) 


#def ine 


S 


PERMUTE MUNGE ( 


s 


) ( 


s 


) 


tfdefine 


c_ 


_PERMUTE_MUNGE ( 


c 


) ( 


c 


) 


ttdef ine 


L 


INDEX MUNGE ( x 


) 


( x 


) 




#def ine 


S 


INDEX MUNGE ( x 


) 


( x 


) 
) 




#def ine 


C 


INDEX MUNGE ( x 


) 


( x 





flendif 

tfdefine L PERMUTE MASK ( 11, 12, 13 , 14 ) \ 

L PERMUTE MUNGE ( 11 ) , L PERMUTE MUNGE ( 12 ) , \ 

L_PERMUTE_MUNGE ( 13 ) , L_PERMUTE_MUNGE ( 14 ), 

#define S PERMUTE MASK ( si, s2, S3, s4 , s5 , s6, s7, s8 ) \ 
S PERMUTE_MUNGE ( si ), S_PERMUTE_MUNGE ( s2 ) , \ 
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S PERMUTE MUNGE ( S3 ) , S PERMUTE MUNGE ( s4 ) , \ 

S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE ( s6 ) , \ 

S_PERMUTE_MUNGE ( s7 ) , S_PERMUTE_MUNGE ( s8 ) , 

tfdefine C PERMUTE MASK ( cl, c2, c3, c4 , c5, c6, c7, c8, \ 

c9, clO, ell, cl2, cl3, cl4, cl5 ( cl6 ) \ 
C PERMUTE MUNGE ( cl ) , C PERMUTE MUNGE ( c2 ) , \ 
C PERMUTE MUNGE ( c3 ) , C PERMUTE MUNGE ( C4 ) , \ 
C PERMUTE MUNGE { c5 ) , C PERMUTE MUNGE ( c6 ) , \ 
C PERMUTE MUNGE { c7 ) , C PERMUTE MUNGE ( c8 ) , \ 
C PERMUTE MUNGE ( c9 ) , C PERMUTE MUNGE { ClO ) , \ 
C PERMUTE MUNGE ( ell ) , C PERMUTE MUNGE ( Cl2 ) , \ 
C PERMUTE MUNGE ( cl3 ) , C PERMUTE MUNGE ( Cl4 ) , \ 
C_PERMUTE_MUNGE ( Cl5 ), C_PERMUTE_MUNGE ( Cl6 ), 

/* 

* macro for a microcode entry point (e.g. vaddx, vaddx_) 

* U__ENTRY is a "nop" for C code 

*/ 

fldefine U_ENTRY ( func_name ) 
/* 

* macros for C function prototypes 
*/ 

^define C PROTOTYPE_0 ( f unc_name ) \ 
void func_name ( void ) ; 

#def ine C PROTOTYPE_l ( f unc_name ) \ 
void func_name ( long ) ; 

#define C PR0T0TYPE_2 ( func name ) \ 
void func_name ( long, long ) ; 

#def ine C PR0T0TYPE_3 ( func name ) \ 
void func_name { long, long, long ) ; 

#def ine C PROTOTYPE_4 ( func name ) \ 

void func_name { long, long, long, long ) ; 

#define C PR0T0TYPE_5 ( func name ) \ 

void func_name ( long, long, long, long, long ) ; 

#define C PR0T0TYPE_6 ( func name ) \ 

void func_name { long, long, long, long, long, long ); 

#define C PR0T0TYPE_7 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long ) ; 

#define C PROTOTYPE^ ( func name ) \ 

void func_name { long, long, long, long, long, long, long, long ); 

#define C PROTOTYPE^ ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long ) ; 

#define C PROTOTYPE_10 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long ) ; 

#define C PR0T0TYPE_11 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long ); 

#define C PR0T0TYPE_12 ( func name ) \ 

void func^name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long ) ; 
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#define C PROTOTYPE_13 ( func name ) \ 

void funcjiame ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long ); 

^define C PR0T0TYPE_14 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long ) ; 

tfdefine C PROTOTYPE_15 ( func name ) \ 

void func name ( long, long, long, long, long, long, long, long, \ 
" long, long, long, long, long, long, long ) ; 

tfdefine C PR0T0TYPE_16 ( func name ) \ 

void func name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long, long ); 

^l^^'S'lVr*. r7, r8, r9, rlO, rll, r!2, r!3, r!4, rlS, r!6, r!7, 

X rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 

r31; 

Sd 1Sg A ^°-r 4 S. r r6. V r7, r8. r9. rlO. rll, r!2. rl3. r!4. r!5. rlS r!7 \ 
rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO r5 r31 \ x 
long r5,"r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24 , r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO r6 r31 \ 

long r6,"r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO r7 r31 \ 

long r7,~r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r3li 

#define AUT0_r8 r31 \ 

long r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO r9 r31 \ 

long r9, rlO, rll, rl2, rl3, rl4 , rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rlO r31 \ 

long rlO, rll, rl2, rl3, rl4, rl5 # r!6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rll r31 \ 

long rll, rl2, rl3, rl4, rl5, rl6, rl7, \ ^ 

rl8, rl9, r20, r21, r22, r23, r24 , r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rl2 r31 \ 

long rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

ttdefine AUTO rl3 r31 \ o ^ _ A oc . 

long rl3, rl4, rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26^ r27, r28, r29, r30, r31; 
#define AUTO rl4 r31 \ oc v 

long rl4, rl5, rl6, rl7, rl8, rl9, r20, r21, r22 , r23, r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
#define AUTO rl5 r31 \ x 
long rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23 , r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
#define AUTO_rl6_r31 \ 
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long rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24 , r25, \ 

r26, r27, r28 f r29, r30, r31; 
#define AUTO rl7 r31 \ 

long rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 

r26, r27, r28, r29, r30, r31; 
tfdefine AUTO rl8 r31 \ 

long rl8, rl9, r20, r21, r22, r23, r24 , r25, \ 

r26, r27, r28, r29, r30, r31; 
^define AUTO rl9 r31 \ 

long rl9, r20, r21, r22, r23, r24 , r25, \ 

r26, r27 , r28, r29, r30, r31; 

define AUTO fO £31 \ fj „ ^ fl4< ^ 

5Ss. fi6 £17. £18. £19. £20. £21. £22, £23, £24, £25, £26. £27. \ 
f28, £29, f30, £31; 

#de £SbirSo?°d?"d2. d3, d4. dB, dS. d7. d8. d9, dlO. dll. d!2 d!3 d!4 \ 
dl5, dl6, dl7, dl8, dl9, d20, d21, d22, d23, d24, d25, d26, d27, \ 
d28, d29, d30, d31; 

#if defined ( BUILD MAX ) 

#de ^x!rTv^vl' v2, v3, v4, v5. v6. v7, v8, v9. vlO. vll, v!2, v!3, v!4, 

X vl5, vl6, vl7, vl8 # vl9, v20, v21, v22, v23, v24 , v25 , v26, v27, \ 

v28! v29, v30, v31; 

#endif 

/* c 

* For C implementation, create a dummy stack on function entry of size 

4 096. 
*/ 

#define STACK_SIZE 4 096 
/* 

* macros for C and Fortran callable entry points 
*/ 

tfdefine ENTRY 0( func name ) \ 
C PROTOTYPE 0 ( func name ) \ 
void func_name ( void ) \ 

' \ong CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r3 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr 'save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY l( func name, argO ) \ 
C PROTOTYPE 1 ( func name ) \ 
void func_name ( long argO ) \ 

' \ong CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r4 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 3 ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 
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tfdefine ENTRY 2( func name, argO, argl ) \ 
C PROTOTYPE 2( func name ) \ 
void func name ( long argO, long argl ) \ 

{ \ , 

long CR[83; ulong CTR; ulong VSCR; long r0; \ 
AUTO r5 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area ( 2*18 + 4 ); \ 
long vr save area [ 4*12 + 4 ],- \ 
long stack [STACK_SIZE + 4], sp; 

^define ENTRY 3( func name, argO, argl, arg2 ) \ 
C PROTOTYPE 3( func name ) \ 

void func name ( long argO, long argl, long arg2 ) \ 
{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r6 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 3; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 4( func name, argO, argl, arg2, arg3 ) \ 
C PROTOTYPE 4 ( func name ) \ 

void func name ( long argO, long argl, long arg2, long arg3 ) \ 

{ \ 4 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO r7 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ]; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack (STACK_SIZE + 4], sp; 

#define ENTRY 5( func name, argO, argl, arg2, arg3, arg4 ) \ 
C PROTOTYPE 5( func name ) \ 

void func name ( long argO, long argl, long arg2, long arg3, long arg4 ) \ 
{ \ 

long CR[83; ulong CTR; ulong VSCR; long rO; \ 
AUTO r8 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 +4 ]; \ 
long stack [STACK_SIZE + 4], sp,- 

#define ENTRY 6( func name, argO, argl, arg2, arg3, arg4, arg5 ) \ 
C PROTOTYPE 6( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4 , \ 
long arg5 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO r9 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr_save_area [ 19+4 ]; \ 
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long fpr save area [ 2*18 + 4 ); \ 
long vr save areat 4*12 +4 ); \ 
long stack [STACK_SIZE + 4], sp; 

tfdefine ENTRYJ7 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6 ) \ 
C PROTOTYPE 7( func name ) \ 

void funcjname ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long r0; \ 
AUTO rlO r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area[ 2*18 + 4 ],- \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_8 ( funcjname, argO, argl, arg2, arg3 f arg4, arg5, \ 

arg6, arg7 ) \ 
C PROTOTYPE 8( func name ) \ 

void func_name { long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rll r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ]; \ 
long fpr save area [ 2*18 + 4 3; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8 ) \ 
C PROTOTYPE 9 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8 ) \ 

{ \ v 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 

AUTO rl2 r31 \ 

AUTO fO f31 \ 

AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 

long fpr save area [ 2*18 + 4 ],- \ 

long vr save area [ 4*12 +4 ]; \ 

long stack [STACK_SIZE + 4], sp; 

tfdefine ENTRY_10( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9 ) \ 
C PROTOTYPE 10 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9 ) \ 

{ \ v 
long CR[8]; ulong CTR; ulong VSCR; long rO; \ 

AUTO rl3 r31 \ 

AUTO fO f31 \ 

AUTO dO d31 \ 

AUTO_v0 v31 \ 

long gpr save area [ 19+4 ]; \ 

long fpr save area [ 2*18 + 4 ]; \ 

long vr save area [ 4*12 +4 3; \ 

long stack [STACK_SIZE + 43, sp; 
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long arg3, long arg4, \ 
long arg8, long arg9, \ 



#define ENTRY_11 ( func_name, argO, argl # arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO ) \ 
C PROTOTYPE 11 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9 # \ 
long arglO ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long r0; \ 
AUTO rl4 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp ; 

tfdefine ENTRY_12 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8 f arg9, arglO, argil ) \ 
C PROTOTYPE 12 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, 

long arg5, long arg6, long arg7, 

long arglO, long argil ) \ 
{ \ 

long CR[8]; ulong CTR; ulong VSCR; long r0; \ 
AUTO rl5 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ]; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK__SIZE + 4], sp; 

#define ENTRY_13 ( func_name, argO, argl, arg2, arg3 , arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2 ) \ 
C PROTOTYPE 13 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO, long argil, long argl2 ) \ 

{ \ 

long CR[8}; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl6 r31 \ 

AUTO fO f31 \ , 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_14 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3 ) \ 
C PROTOTYPE 14 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, 

long arg5, long arg6, long arg7, 



long arg3, long arg4, \ 
long arg8, long arg9, \ 



{ \ 



long arglO, long argil, long argl2, long argl3 ) \ 



long CR[8]; ulong CTR; ulong VSCR; 
AUTO rl7 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr_save_area [ 19 + 4 ] ; \ 



long rO; \ 
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long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 3; \ 
long stack [STACK_SIZE + 4], sp; 

fldefine ENTRY_15{ func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 
C PROTOTYPE 15 ( func name ) \ 

long argl, long arg2 
long arg5, long arg6, long arg7 

long arglO, long argil, long argl2 , long argl3, \ 
long argl4 ) \ 



void func_name ( long argO, 



long arg3, long arg4, \ 
long arg8, long arg9, \ 



{ \ 



long CR[83; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl8 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_vO v31 \ 

long gpr save area[ 19+4 ]; \ 
long fpr save area[ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 1; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_16 ( func_name, argO, argl, arg2 # arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 , arglS ) \ 
C PROTOTYPE 16 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO, long argil, long argl2, long argl3, \ 
long argl4 , long arglS ) \ 



{ \ 



long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl9 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 3 ; \ 
long fpr save area [ 2*18 + 4 3; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 



* macros to get GPR arguments beyond 8 
*/ 

#define GET ARG8 ( rD ) 
#define GET ARG9 ( rD ) 
#define GET ARG10( rD ) 
#def ine GET ARG11 ( rD ) 
#def ine GET ARG12 ( rD ) 
#def ine GET ARG13 ( rD ) 
#def ine GET ARG14 ( rD ) 
#define GET ARG15 ( rD ) 
#def ine GET ARG16 { rD ) 
#define GET ARG17 ( rD ) 



* macros to set GPR arguments beyond 8 
*/ 

#define SET ARG8 ( rD ) 
#define SET ARG9 ( rD ) 
ttdefine SET ARG10( rD ) 
#define SET ARG11{ rD ) 
ttdefine SET ARG12 ( rD ) 
#def ine SET ARG13 ( rD ) 
#def ine SET ARG14 ( rD ) 
#define SET ARG15 ( rD ) 
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tfdefine SET ARG16 ( rD ) 
tfdefine SET_ARG17 ( rD ) 

/* 

* macro to branch from one entry point to another 

*/ x v 

tfdefine BR FUNC( func_name ) \ 

f unc_name ( ) ; \ 

/* 

* macros to call functions 
*/ 

ttdefine CALL_FUNC( func_name ) \ 
f unc_name ( ) ; 

/* 

* macros to call functions 

#define CALLJ) ( func_name ) \ 
f unc_name { ) ; 

#define CALL_1 ( func name, argO ) \ 
func_name { argO ) ; 

tfdefine CALL_2 ( func_name, argO, argl ) \ 
func_name ( argO, argl ) ; 

#define CALL_3 ( func_name, argO, argl, arg2 ) \ 
func_name ( argO, argl, arg2 ) ; 

^define CALL_4 ( func_name, argO, argl, arg2, arg3 ) \ 
func_name ( argO, argl, arg2, arg3 ); 

#define CALL_5 ( func_name, argO, argl, arg2, arg3, arg4 ) \ 
func_name ( argO, argl, arg2, arg3, arg4 ) ; 

#define CALL_6 ( func_name, argO, argl, arg2, arg3, arg4, arg5 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5 ) ; 

#define CALL_7 ( func_name, argO, argl, arg2, arg3, arg4, arg5, arg6 ) \ 
funcjiame ( argO, argl, arg2, arg3, arg4, arg5, arg6 ); 

#define CALL 8{ func_name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ) \ 
func_name~( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ); 

#define CALL_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8 ) \ n . 

func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8 ) ; 

ttdefine CALL_10( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, 
arg8, arg9 ) \ 

func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9 ) ; 

#define CALL_11( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO ) ; 

#define CALL_12 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 , \ 
arg8, arg9, arglO, argil ) ; 

tfdefine CALL 13 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 ) \ 
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func_name ( argO, argl, arg2, arg3, arg4, arg5 # arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 ); 



#define CALL_14 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3 ); 

fldefine CALL_15( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4 ); 

fldefine CALL_16( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4, argl5 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 

arg8, arg9, arglO, argil, argl2, argl3, argl4 , argl5 ); 

#if defined ( BUILD_MAX ) 



G4 macros to create a dummy jump table, 
(not supported in C) 



#def ine 


DECLARE 


VMX 


VI { 


root 


name 


#def ine 


DECLARE 


VMX 


V2( 


root 


name 


#def ine 


DECLARE 


VMX 


V3( 


root 


name 


#def ine 


DECLARE 


VMX 


V4( 


root 


name 


#def ine 


DECLARE, 


_VMX_ 


,V5( 


root_ 


_name 


#def ine 


DECLARE 


VMX 


Zl( 


root 


name 


#def ine 


DECLARE 


VMX 


Z2( 


root 


name 


#def ine 


DECLARE 


VMX 


Z3( 


root 


name 


^define 


DECLARE 


VMX 


Z4( 


root 


name 


#def ine 


DECLARE 


VMX 


Z5( 


root 


_name 



* G4 macros to decide whether to enter a VMX loop 

* VMX loop is entered if at least minimum count, 

* all vectors have the same relative alignment 

* (i.e., same lower 4 bits) and all strides are unit. 

* Note, a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddx() can be implemented with a VMX loop. 

* Only one macro should be invoked per source file. 

* (not supported in C) 
*/ 

#define BR IF VMX VI ( root name, min n imm, unit s imm, pi, si, n, eflag ) 
#define BR_IF_VMX_V1_ALIGNED ( root name, min n_imm, unit_s_imm, \ 

pi, si, n, eflag ) 
tfdefine BR_IF_VMX_V2 ( root name, min n imm, unit_s_imm, \ 

pi, si, p2, s2, n, eflag ) 
#define BR_IF_VMX_V2_LS ( root name, min n imm, unit_s_imm, \ 

pi, si, ps, s2, n, eflag ) 
ttdefine BR_IF_VMX_V2_LC ( root name, min_n imm, unit_s_imm ( \ 

pi, si, pc, n, eflag ) 
#define BR_IF_VMX_V2_ALIGNED ( root name, min n imm, unit_s_imm, \ 

pi, si, p2, s2, n, eflag ) 
tfdefine BR_I F_VMX_V3 ( root name, min n imm, unit_s imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) 
ffdefine BR_IF_VMX_V3_ALIGNED ( root name, min n imm, unit_s imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) 
^define BR_IF_VMX_V4 ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) 
^define BR_IF_VMX_V4_ALIGNED ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) 
^define BR_IF_VMX_V5 ( root_name, min_n_imm, unit_s_imm, \ 
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pi, si, p2, s2, p3, s3, p4, s4, p5, s5, n, eflag ) 
tfdefine BR_IF_VMX_V5_ALIGNED( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, S3, p4, s4, p5, S5, n, 
eflag ) 

tfdefine BR_IF_VMX_Z1 { root_name, min n_imm l unit_s_imm, \ 

prl, pil, si, n, eflag ) 
tfdefine BR_IF_VMX_Z2 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) 
tfdefine BR_IF_VMX_Z3 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, n, eflag ) 
^define BR_IF_VMX_Z4 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 

pr4, pi4, s4, n, eflag ) 
tfdefine BR_IF_VMX_Z5 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 

pr4, pi4, s4, pr5, pi5, s5, n, eflag ) 
fldefine BR_IF_VMX_CONV ( root name, min n imm, \ 

pi, si, s2, p3, s3, n, eflag ) 
^define BR_IF_VMX_ZCONV ( root_name, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3, n, eflag ) 

/* 

* G4 macro to get VMX unaligned (FP) count 

* assumes all vectors have the same relative alignment 

* and that the last 2 bits of ptr are 0 

* sets condition code CRO 
*/ 

#define GET VMX UNALIGNED COUNT ( count, ptr ) \ 

{ \ ~ " 

(count) = - (ptr) ; \ 

(count) = ( (count) >> 2) & 3; \ 

CR[0] = (long) (count) ; \ 

} 

/* 

* G4 macro to get VMX unaligned short count 

* assumes that the last bit of ptr is 0 

* sets condition code CRO 
*/ 

#define GET VMX UNALIGNED COUNT S( count, ptr ) \ 

{ \ " " 

(count) = - (ptr) ; \ 

(count) = ( (count) >> 1) & 7; \ 

CR[0] = (long) (count) ; \ 

} 

/* 

* G4 macro to get VMX unaligned char count 

* sets condition code CRO 
*/ 

^define GET VMX UNALIGNED COUNT C( count, ptr ) \ 

{ \ " " " 

(count) = - (ptr) ; \ 
(count) = (count) & 15; \ 
CR[0] = (long) (count) ; \ 

} 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

tfdefine SCALAR_SPLAT ( vt, vtmp, scalarp ) \ 

(vt).f[0] = (vt).ftl] = (vt).f[2] = (vt).f[3] = *scalarp; 

«endif /* end BUILD^MAX */ 

/* 

* cache (DCBT and DCBZ) macros. 
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tfdefine DCBT TRUE ( cond_bit, scratch ) \ 

CR[(condbit>] = -1; /* true (<= 0) 1 

#define DCBZ TRUE ( cond_bit, scratch ) \ 
DCBT_TRUE( cond_bit, scratch ) 

tfdefine DCBT FALSE ( cond_bit, scratch ) \ . 
CR[ (cond_bit) 3 = 1; /* false (> 0) 1 

tfdefine DCBZ FALSE { cond_bit, scratch ) \ 
DCBT_ FALSE ( cond_bit, scratch ) 

#define SET DCBT COND( cond bit, cache bit, eflag, scratchl ) \ 
CR[(cond_bit)] = (eflag & (cachej>it) ) ; 

#define SET DCBZ C0ND( cond bit, cache bit ef lag. buf f er st ride, \ 
~ ~ unit stride, count, tmpl, tmp2, tmp3j \ 

CR[(cond_bit)] = (eflag & (cache_bit ) ) ; 

#define DCBT IF( cond bit, rA, rB ) \ 
if ( CRHcond bit)] <= 0 ) \ 
{ DCBT ( rA, rB ) } 

^define DCBZ IF( cond bit, rA, rB ) \ 
if ( CR[(cond bit)] <= 0 ) \ 
{ DCBZ( rA, rB ) } 

tfdefine DCBT IF CACHABLE ( cond_bit, rA, rB ) \ 
DCBT_IF( cond__bit, rA, rB ) 

#define DCBZ IF CACHABLE ( cond_bit, rA, rB ) \ 
DCBZ_IF( cond_bit, rA, rB ) 

#define BR IF CACHABLE ( cond bit, label ) \ 
if ( CR[(cond bit)] <= 0 ) \ 
goto label; 

#define BR IF NOT CACHABLE ( cond_bit, label ) \ 
if ( CR[(cond bit)] > 0 ) \ 
goto label ; 

/* 

* ASIC macros 
*/ 

#if defined ( COMP I LE_PREFETCH ) 

#define LOAD PREFETCH CONTROL { mode, scratchl, scratch2 ) \ 
* (volatile long * ) PREFETCH_CONTROL = (mode) ; 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
♦(volatile long *)MISCON_B = (mode) ; 

#define RESET__PREFETCH_CONTROL ( scratchl, scratch2 ) \ 

I X 

volatile long l; \ 

i = * (volatile long *)MISC0N_B; \ 

i t= PREFETCH MASK; \ 

i |= USE PREFETCH CONTROL; \ 

♦(volatile long * ) PREFETCH_CONTROL = i; \ 

} 

#else 

#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 
#define LOAD MISCON B( mode, scratchl, scratch2 ) 
#define RESET__PREFETCH_CONTROL ( scratchl, scratch2 ) 
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tfendif 
/* 

* instruction macros 
*/ 

tfdefine ADD ( rD, rA, rB ) 
tfdefine ADD_C( rD, rA, rB ) 
(long) (rD) ; 

Udefine ADDI ( rD, rA, SIMM ) 
Hdefine ADDIC_C( rD, rA, SIMM ) 
rD) ; 

fldefine ADDIS ( rD, rA, SIMM ) 
^define AND ( rA, rS, rB ) 
^define AND_C( rA, rS, rB ) 
(long) (rA) ; 

#define ANDC ( rA, rS, rB ) 
tfdefine ANDC_C( rA, rS, rB ) 
(long) (rA) ; 

#define ANDI_C( rA, rS, UIMM ) 
rA) ; 

ttdefine ANDIS_C( rA, rS, UIMM ) 

Hdefine BA( addr ) 
#define BCTR 
#define BEQ( label ) 
^define BEQ PLUS( label ) 
#define BEQ MINUS ( label ) 
fldefine BEQ CR( bit, label ) 
fldefine BEQ CR PLUS ( bit, label ) 
fldefine BEQ CR_MINUS ( bit, label ) 
tfdefine BEQLR 
#define BEQLR PLUS 
fldefine BEQLR MINUS 
#define BEQLR CR( bit ) 
#define BEQLR CR PLUS ( bit ) 
tfdefine BEQLR CR MINUS ( bit ) 
tfdefine BGE ( label ) 
#define BGE PLUS( label ) 
tfdefine BGE MINUS ( label ) 
tfdefine BGE CR( bit, label ) 
ttdefine BGE CR PLUS ( bit, label ) 
#define BGE CR_MINUS( bit, label ) 
#define BGELR 
#define BGELR PLUS 
^define BGELR MINUS 
^define BGELR CR( bit ) 
^define BGELR CR PLUS ( bit ) 
^define BGELR CR MINUS ( bit ) 
^define BGT ( label ) 
#define BGT PLUS( label ) 
#define BGT MINUS ( label ) 
ttdefine BGT CR( bit, label ) 
ttdefine BGT CR PLUS( bit, label ) 
^define BGT CR_MINUS ( bit, label ) 
^define BGTLR 
tfdefine BGTLR PLUS 
tide fine BGTLR MINUS 
tfdefine BGTLR CR ( bit ) 
tfdefine BGTLR CR PLUS ( bit ) 
tf define BGTLR CR MINUS ( bit ) 
tfdefine BL( func name ) 
fldefine BLE ( label ) 
fldefine BLE PLUS( label ) 
#define BLE MINUS ( label ) 
#define BLE CR( bit, label ) 
^define BLE_CR_PLUS( bit, label ) 
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(rD) 


= (rA) 


+ 


(rB) ; 




(rD) 


= (rA) 


+ 


(rB) ; CR[0] = 




(rD) 


= (rA) 


+ 


(SIMM) ; 




(rD) 


= (rA) 


+ 


(SIMM) ; CR[0] = 


(long) ( 


(rD) 


= (rA) 


+ 


( (SIMM) « 16) ; 




(rA) 


= (rS) 


& 


(rB) ; 




(rA) 


= (rS) 


& 


(rB); CR[0] = 




(rA) 


= (rS) 


& 


-(rB) ; 




(rA) 


= (rS) 


& 


~(rB); CR[0] = 




(rA) 


= (rS) 


& 


(UIMM) ; CR[0] = 


(long) ( 


(rA) 


= (rS) 


& 


( (UIMM) « 16) ; 


\ 




CR[0] 




: (long) (rA) ; 





goto (addr) ; 
(♦(void (*) (void) )CTR) () ; 
if ( CR[0] 0 ) goto label; 
BEQ ( label ) 
BEQ ( label ) 

if ( CR[(bit)] == 0 ) goto label; 

BEQ CR( bit, label ) 

BEQ CR( bit, label ) 

if ( CR[0] == 0 ) return; 

BEQLR 

BEQLR 

if ( CR[(bit)] == 0 ) return; 

BEQLR CR( bit ) 

BEQLR CR( bit ) 

if ( CR[0] >= 0 ) goto label; 

BGE ( label ) 

BGE ( label ) 

if ( CR[(bit)3 >= 0 ) goto label; 

BGE CR( bit, label ) 

BGE CR( bit, label ) 

if ( CR[0] >= 0 ) return; 

BGELR 

BGELR 

if ( CRt(bit)] >= 0 ) return; 

BGELR CR( bit ) 

BGELR CR( bit ) 

if ( CR[0] > 0 ) goto label; 

BGT ( label ) 

BGT ( label ) 

if ( CR[(bit)] > 0 ) goto label; 

BGT CR( bit, label ) 

BGT CR( bit, label ) 

if ( CR[0] > 0 ) return; 

BGTLR 

BGTLR 

if ( CR[{bit)3 > 0 ) return; 
BGTLR CR( bit ) 
BGTLR CR( bit ) 
f unc_name ( ) ; 

if ( CR[0] <= 0 ) goto label; 
BLE ( label ) 
BLE ( label ) 

if ( CR[(bit)3 <= 0 ) goto label; 
BLE_CR( bit, label ) 
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#define BLE CR_MINUS( bit, label ) 

#define BLELR 

tfdefine BLELR PLUS 

^define BLELR MINUS 

ffdefine BLELR CR< bit ) 

fldefine BLELR CR PLUS ( bit ) 

#define BLELR_CR_MINUS ( bit ) 

tfdefine BLR 

fldefine BLT ( label ) 

#define BLT PLUS ( label ) 

tfdefine BLT MINUS ( label ) 

tfdefine BLT CR( bit, label ) 

tfdefine BLT CR PLUS ( bit, label ) 

#define BLT CR_MINUS ( bit, label ) 

#define BLTLR 

#define BLTLR PLUS 

^define BLTLR MINUS 

^define BLTLR CR ( bit ) 

^define BLTLR CR PLUS ( bit ) 

^define BLTLR CR MINUS ( bit ) 

fldefine BNE ( label ) 

tfdefine BNE PLUS ( label ) 

#define BNE MINUS ( label ) 

^define BNE CR( bit, label ) 

#define BNE CR PLUS{ bit, label ) 

^define BNE CR_MINUS ( bit, label ) 

#define BNELR 

#define BNELR PLUS 

#define BNELR MINUS 

JJdefine BNELR CR ( bit ) 

tfdefine BNELR CR PLUS ( bit ) 

^define BNELR CR MINUS ( bit ) 

tfdefine BR ( label ) 

#define CLRLWI ( rA, rS, nbits ) 

^define CLRLWI_C ( rA, rS, nbits ) 

\ 

#define CLRRWI ( rA, rS, nbits ) 
tfdefine CLRRWI_C ( rA, rS, nbits ) 

#define CMPLW ( rA, rB ) 

ttdefine CMPLW_CR{ bit, rA, rB ) 
? \ 

#define CMPLW I ( rA, UIMM ) 



fldefine CMPLWI_CR ( bit, rA, UIMM ) 
3D) ? \ 



fldefine CMPW{ rA, rB ) 

fldefine CMPW CR( bit, rA, rB ) 

#define CMPWI ( rA, SIMM ) 

^define CMPWI_CR( bit, rA, SIMM ) 

#define DCBF ( rA, rB ) 

#define DCBI ( rA, rB ) 

fldefine DCBST{ rA, rB ) 

Hdefine DCBT ( rA, rB ) 

ftdefine DCBTST ( rA, rB ) 

tfdefine DCBZ( rA, rB ) *(long * 

Mlong * 
* (long * 
*(long * 
\ 
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goto label ; 



BLE CR( bit, label ) 
if ( CR[0] <= 0 ) return; 
BLELR 
BLELR 

if ( CRt(bit)] <= 0 ) return; 
BLELR CR( bit ) 
BLELR CR( bit ) 
return; 

if ( CR[0] < 0 ) goto label; 
BLT ( label ) 
BLT ( label ) 
if ( CR[(bit)] < 0 ) 
BLT CR( bit, label ) 
BLT CR{ bit, label ) 
if ( CR[0] < 0 ) return; 
BLTLR 
BLTLR 

if ( CR[(bit)] < 0 ) return; 

BLTLR CR( bit ) 

BLTLR CR( bit ) 

if ( CRCO) !- 0 ) goto label; 

BNE ( label ) 

BNE ( label ) 

if ( CR[(bit)] != 0 ) goto label; 

BNE CR( bit, label ) 

BNE CR( bit, label ) 

if ( CR[0] != 0 ) return; 

BNELR 

BNELR 

if ( CR[(bit)] 
BNELR CR( bit ) 
BNELR CR( bit ) 
goto label; 
(rA) = (rS) & ((1 « 
(rA) = (rS) & ((1 « 



!= 0 ) return; 



(32-nbits) ) - 1) , 
(32-nbits) ) - 1) , 



\ 



CR[0] = (long) (rA) ; 

(rA) = (rS) & -((1 « nbits) - 1) 

(rA) = (rS) & ~((1 « nbits) - 1) 

CR[0] = (long) (rA) ; 

CR[03 = (((rA) A (rB)) & (1 « 31)) ? 

( (rB) - (rA) ) : ( (rA) - (rB) ) ; 
CRC(bit)] = (((rA) A (rB)) & (1 « 31)) 



\ 



((rB) - (rA)) : ( (rA) - 
CR[0] = ( ( (rA) * (UIMM) ) & (1 



(rB) ) ; 
:< 31)) 



( (UIMM) - (rA) ) : ( (rA) - 
(UIMM) ) ; 

CR[(bit)] = ( ( ( rA) A (UIMM) ) & (1 « 

( (UIMM) - (rA) ) : ( (rA) - 

(UIMM) ) ; 
CR[0] = (rA) - (rB) ; 
CR[(bit)] = (rA) - (rB); 
CR[0] = (rA) - (SIMM); 
CR[(bit)] = (rA) - (SIMM); 



(((rA)+(rB)) & -CACHE LINE MASK) = 0; \ 
( ( ( (rA)+(rB) ) & -CACHE LINE MASK) +4) - 0; \ 
( ( ( (rA)+ (rB) ) & -CACHE LINE MASK) +8) = 0; \ 
( ( ( (rA)+ (rB) ) Sc -CACHE_LINE_MASK) +12) = 0 ; 
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tfdefine DECR ( rD ) 
#define DECR C( rD ) 
^define DIVW( rD, rA # rB ) 
#define DIVW_C( rD, rA, rB ) 
(long) (rD) ; 

tfdefine DIVWU( rD, rA, rB ) 
#define DIVWU C( rD, rA, rB 



* (long 


*) <<(<rA)+(rB)) 


& 


~CACHE_LINE_MASK) +16) 


= 0 


\ 

* ( long 


*) U((rA) + (rB)) 


& 


~CACHE_LINE_MASK) +20) 


= 0 


\ 

* (long 


*) ((((rA)+(rB)) 


& 


~ CAC H E_L I NE_MAS K ) +24) 


= 0 


\ 

* (long 


*) ((((rA) + (rB)) 


& 


~CACHE_LINE_MASK) +28) 


= 0 



#def ine 
.#def ine 

#def ine 
-(frB) ; 
#def ine 
#def ine 
ttdef ine 
{ \ 



EQV( rA, rS, rB ) 
EQV_C( rA, rS, rB ) 

FABS ( frD, frB ) 

FADD ( frD, frA, frB 
FADDS ( frD, frA, frB 
FCMPO( bit, frA, frB 



) 



\ 



--(rD) 
--(rD) j 
(rD) = 
(rD) = 

(rD) = 
(rD) = 

(rA) = 
(rA) = 
CRtO] « 
(frD) = 

(frD) « 
(frD) s 



CR[0] 
(rA) / 
(rA) / 



= (long) (rD) 
(rB) ; 

(rB); CR[0] . 



(ulong) (rA) / (ulong) (rB) ; 

(ulong) (rA) / (ulong) (rB) ; 

CR[0] = (long) (rD) ; 

~((rS) A (rB)); 

~((rS) A (rB)); \ 
= (long) (rA) ; 
= ((frB) >= 0.0) 



(frB) 



(frA) 
(frA) 



(frB) 
(frB) 



} 



if ( (frA) < (frB) ) CR[(bit)] = -1; \ 
else if { (frA) > (frB) ) CR[(bit)] = 1; \ 
else CR[(bit)] = 0; \ 



FCMPU{ bit, frA, frB 
FCTIW( frD, frB ) 
FCTIWZ( frD, frB ) 



\ 



#def ine 
#def ine 
ttdef ine 

{ \ 

union { \ 

long i [2] ; \ 

double d; \ 
} u; \ 

u.i [0] = (long) (frB) 
u.i[l] = 0; \ 
(frD) = u.d; \ 

} 

FDIV( frD, frA, frB ) 
FDIVS( frD, frA, frB ) 
FMADD ( frD, frA, 
FMADDS( frD, frA 
FMOV( frD, frB ) 
FMR ( frD, frB ) 
FMUL ( frD, frA, frB ) 
FMULS( frD, frA, frB ) 
FMSUB( frD, frA, frC, 
FMSUBS( frD, frA, frC, 
FNABS ( frD, frB ) 



FCMPO( bit, frA, frB ) 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
(frB) ,- 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
tfdef ine 
#define 
#def ine 
#define 
(frB) ; 
ttdef ine 
#def ine 
#def ine 
fldef ine 
tfdef ine 



\ 



frC, 
. frC, 



frB) 
frB) 



frB ) 

frB ! 



FN EG ( frD, frB ) 

FNMADD ( frD, frA, frC, frB ) 

FNMADDS ( frD, frA, frC, frB ) 

FNMSUB ( frD, frA, frC, frB ) 

FNMSUBS( frD, frA, frC, frB ) 

FRES ( frD, frB ) 

FRSP( frD, frB ) 

FRSQRTE ( frD, frB ) 

FSEL ( frD, frA, frC, frB ) 

FSUB( frD, frA, frB ) 
FSUBS( frD, frA, frB ) 
GOTO( label ) 
INCR( rD ) 
INCR C( rD ) 



(frD) 


= (frA) / 


(frB) ; 




(frD) 


= (frA) / 


(frB) ; 




(frD) 


= (frA) * 


(frC) + 


(frB); 


(frD) 


= (frA) * 


(frC) + 


(frB) ; 


(frD) 


= (frB); 






(frD) 


= (frB); 






(frD) 


= (frA) * 


(frB) ; 




(frD) 


= (frA) * 


(frB) ; 




(frD) 


= (frA) * 


(frC) - 


(frB); 


(frD) 


= (frA) * 


(frC) - 


(frB); 


(frD) 


= ((frB) 


>= 0.0) 


? -(frB) : 


(frD) 


= -(frB); 






(frD) 


= -((frA) 


* (frC) 


+ (frB)); 


(frD) 


= -((frA) 


* (frC) 


+ (frB)); 


(frD) 


= -((frA) 


* (frC) 


- (frB)); 


(frD) 


= -((frA) 


* (frC) 


- (frB)); 


(frD) 


= (float) (frB) ; 




(frD) 


= ((frA) 


>= 0.0) 


? (frC) : 



(frD) 
(frD) 

BR ( label 
++(rD) ; 
++ (rD) ; CR[0] 



(frA) 
(frA) 
) 



- (frB); 

- (frB); 



(long) (rD) 
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fldefine LA { rD, symbol, SIMM ) 
#define LABEL ( label ) 
#define LBZ( rD, rA, d ) 
ttdefine LBZA ( rD, symbol ) 
^define LBZU( rD, rA, d ) 
ttdefine LBZUX( rD, rA, rB ) 
^define LBZX( rD, rA, rB ) 
tfdefine LFD ( frD, rA, d ) 
#define LFDU ( frD, rA, d ) 
#define LFDUX( frD, rA, rB ) 
Hdefine LFDX ( frD, rA, rB ) 
tfdefine LFS( frD, rA, d ) 
tfdefine LFSA ( frD, symbol, rT ) 
tfdefine LFSU( frD, rA, d ) 
tfdefine LFSUX ( frD, rA, rB ) 
#define LFSX{ frD, rA, rB ) 
tfdefine LHA ( rD, rA, d ) 
fldefine LHAA ( rD, symbol ) 
#define LHAU ( rD, rA, d ) 
tfdefine LHAUX ( rD, rA, rB ) 
#define LHAX ( rD, rA, rB ) 
#define LHZ( rD, rA, d ) 
ttdefine LHZA ( rD, symbol ) 
tfdefine LHZU( rD, rA, d ) 
#define LHZUX ( rD, rA, rB ) 
tfdefine LHZX( rD, rA, rB ) 
^define LI ( rD, SIMM ) 
tfdefine LIS( rD, SIMM ) 
tfdefine LOAD_COUNT( rD ) 
#define LWZ( rD, rA, d ) 
tfdefine LWZA{ rD, symbol ) 
tfdefine LWZU( rD, rA, d ) 
tfdefine LWZUX( rD, rA, rB ) 
#define LWZX( rD, rA, rB ) 
^define MCRF ( crfD, crfS ) 
#define MCRFS ( crfD, crfS ) 
tfdefine MFCR ( rD ) 
tfdefine MFCTR ( rD ) 
tfdefine MFLR ( rD ) 
tfdefine MFSPR( rD, SPR ) 
#define M0V( rA, rS ) 
^define M0V_C( rA, rS ) 
tfdefine MR ( rA, rS ) 
fldefine MR C( rA, rS ) 
tfdefine MTCR ( rD ) 
tfdefine MTCTR ( rD ) 
#define MTFSFI ( crfD, IMM ) 
fldefine MTLR ( rD ) 
tfdefine MTSPR{ SPR, rS ) 
#define MULL I { rD, rA, SIMM ) 
tfdefine MULLW< rD, rA, rB ) 
#define MULLW_C( rD, rA, rB ) 
(long) (rD) ; 

tfdefine NAND ( rA, rS, rB ) 
tfdefine NAND_C ( rA, rS, rB ) 
rA) ; 

#define NEG ( rD, rA ) 
#define NEG_C( rD, rA ) 
ttdefine NOP 

#define N0R( rA, rS, rB ) 
#define NOR_C( rA, rS, rB ) 
rA) ; 

#define OR( rA, rS, rB ) 
#define OR C( rA, rS, rB ) 
(long) (rA) ; 

fldefine ORC( rA, rS, rB ) 
#define ORC_C ( rA, rS, rB ) 



(rD) 
label: 
(rD) = 
(rD) = 
(rD) . 
(rD) , 
(rD) : 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 
CTR = 
(rD) 
(rD) 
(rD) 
(rD) 
(rD) 



= (long) &( symbol ) ; 



* (uchar 
* (uchar 
* (uchar 
* (uchar 
* (uchar 



*) ((rA) + (d) ) ; 
* ) & ( symbol ) ; 
*) ((rA) += (d)) ; 
*) ((rA) += (rB)) ; 



(rB)); 
+ (d)); 
+= (d)); 
+= (rB)) 
+ (rB)); 
+ (d) ) ; 
* ) & ( symbol ) ; 
*) ((rA) += (d)) ; 
*) ((rA) += (rB)); 
*) ((rA) + (rB)); 
*) {(rA) + (d)) ; 
* ) & ( symbol ) ; 
*) ((rA) +- (d)) ; 
*) ((rA) += (rB)) ; 
*) ( (rA) + (rB) ) ; 

*) ((rA) + (d) ) ; 

*(ushort *)& (symbol); 
Mushort *) ((rA) += (d) ) 
* (ushort *) ( (rA) 
* (ushort *) ( (rA) 
(SIMM) ; 

( (SIMM) « 16) ; 
(rD) ; 

*(long *) ( (rA) + 
*(long *)& (symbol) ; 
*(long *) ((rA) += (d) ) ; 
Mlong *) ((rA) += (rB) ) , 
Mlong *) ((rA) + (rB)); 



*) ( (rA) + 
* (double *) ( (rA) 
= * (double *) ( (rA) 
= * (double *) ( (rA) 
= * (double *) ( (rA) 
= * (float *) ( (rA) 
= * (float 
= * (float 
= * (float 
- * (float 
* (short 
* (short 
* (short 
* (short 
* (short 
* (ushort 



+= (rB)) 
+ (rB)); 



(d)) 



(rA) 
(rA) 
(rA) 
(rA) 



(rS) 
(rS) 
(rS) 
(rS) 



CR[0] - (long) (rA) ; 
CR[0] = (long) (rA) ; 



(rD) 
(rD) 
(rD) 


= (rA) * 
= (rA) * 
= (rA) * 


(SIMM) ; 
(rB) ; 

(rB) ; CR[0] = 


(rA) 
(rA) 


= -((rS) 
= -( (rS) 


& (rB) ) ; 

& (rB) ) ; CR[0] = (long) { 


(rD) 
(rD) 


= -(rA); 
= - (rA) ; 


CR(0] = (long) (rA) ; 


(rA) 
(rA) 


= ~< (rS) 
= -( (rS) 


I (rB)); 

| (rB)) ; CR[0] = (long) ( 


(rA) 
(rA) 


= (rS) | 
= (rS) 1 


(rB) ; 

(rB) ; CR[0] = 


(rA) 
(rA) 


= (rS) | 
= (rS) | 


-(rB) ; 

~(rB) ; CR[0] = 
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(rA) = (rS) | (UIMM); 

(rA) = (rS) j ( (UIMM) « 16) ; 

BLR 



(ME)); \ 

- (SH) ) ) ) & mask) 



\ 



(long) (rA) ; 

tfdefine ORI ( rA, rS, UIMM ) 
#define ORIS( rA, rS, UIMM ) 
#define RETURN 

tfdefine RLWIMI ( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ( (ME) - (MB) +1)) - 1) « (31 - 
(rA) &= -mask; \ 

(rA) |= ((((rS) « (SH)) | ( (ulong) (rS) » (32 

^define RLWIMI C( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) + 1)) - 1) « (31 - (ME)); \ 
(rA) &= -mask; \ 

(rA) |= ((((rS) « (SH)) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask) ; \ 
CR[0] = (long) (rA) ,- \ 

#define RLWINM ( rA, rS, SH, MB, ME ) \ 

{ \ v 
ulong mask; \ 

mask = ((1 « ( (ME) - (MB) + 1)) - 1) « (31 - (ME)); \ 
(rA) = (((rS) << (SH)) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask; \ 

#define RLWINM C( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) +1)) - 1) « (31 - (ME)); \ 
(rA) = (((rS) << (SH)) | ( (ulong) (rS) » (32 - (SH)))) & mask; \ 
CR[0] = (long) (rA) ; \ 

} 

tfdefine RLWNM ( rA, rS, rB, MB, ME ) 
tfdefine RLWNM C( rA, rS, rB, MB, ME ) 



RLWINM ( rA, rS, (rB) & Oxlf, MB, ME ) 
RLWINM C( rA, rS, (rB) & Oxlf, MB, ME 



) 

^define EXTLWI ( rA, rS, n, b ) 
tfdefine EXTLWI C( rA, rS, n, b ) 
#define EXTRWI ( rA, rS, n, b ) 
tfdefine EXTRWI C( rA, rS, n, b ) 
#define INSLWI ( rA, rS, n, b ) 
) 

#define INSLWI_C( rA, rS, n, b ) 
(b)+(n)-l ) 

#define INSRWI ( rA, rS, n, b ) 
+ (n)-l ) 

^define INSRWI_C( rA, rS, n, b ) 
b)+(n)-l ) 

#define ROTLW( rA, rS, rB ) 
#define ROTLW C( rA, rS, rB ) 
^define ROTLW I ( rA, rS, n ) 
#define ROTLWI C( rA, rS, n ) 
#define ROTRWI ( rA, rS, n ) 
tfdefine ROTRWI C( rA, rS, n ) 
#define SLW ( rA, rS, rB ) 
#define SLW_C( rA, rS, rB ) 
(long) (rA) ; 

#define SLWI ( rA, rS, SH ) 
tfdefine SLWI_C ( rA, rS, SH ) 
(long) (rA) ; 

tfdefine SRAW( rA, rS, rB ) 
tfdefine SRAW_C ( rA, rS, rB ) 
long) (rA) ; 

tfdefine SRAWI ( rA, rS, SH ) 
tfdefine SRAWI_C( rA, rS, SH ) 
long) (rA) ; 

#define SRW ( rA, rS, rB ) 
tfdefine SRW_C( rA, rS, rB ) 



RLWINM ( rA, rS, (b) , 0, (n) -1 ) 
RLWINM C( rA, rS, (b) , 0, (n) -1 ) 
RLWINM ( rA, rS, (b) + (n), 32- (n) , 31 ) 
RLWINM ( rA, rS, (b)+(n), 32- (n) , 31 ) 
RLWIMI ( rA, rS, 32- (b) , (b) , (b)+(n)-l 

RLWIMI_C( rA, rS, 32- (b) , (b) , 

RLWIMI ( rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

RLWIMI_C( rA, rS, 32- ( (b) + (n) ) , (b) , ( 

RLWNM ( rA, rS, rB, 0, 31 ) 
RLWNM C( rA, rS, rB, 0, 31 ) 
RLWINM ( rA, rS, (n) , 0, 31 ) 
RLWINM C( rA, rS, (n) , 0, 31 ) 
RLWINM ( rA, rS, 32- (n) , 0, 31 ) 
RLWINM ( rA, rS, 32- (n) , 0, 31 ) 



(rA) - 
(rA) = 


(rS) << (rB) ; 
(rS) << (rB) ; 


CR103 




(rA) = 
(rA) = 


(rS) « (SH) ; 
(rS) << (SH) ; 


CR[0] 




(rA) = 
(rA) = 


(long) (rS) >> 
(long) (rS) >> 


(rB) ; 
(rB) ; 


CRtO] = ( 


(rA) = 
(rA) = 


(long) (rS) » 
(long) (rS) >> 


(SH) ; 
(SH) ; 


CR[0] = ( 


(rA) = 
(rA) = 


(ulong) (rS) >> (rB) 
(ulong) (rS) >> (rB) 


; CRC0] = ( 
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long) (rA) ; 

^define SRWI ( rA, rS, SH ) 
ttdefine SRWI_C( rA, rS, SH ) 
long) (rA) ; 

#define STB ( rS, rA, d ) 
tfdefine STBU( rS, rA, d ) 
^define STBUX ( rS, rA, rB ) 
tfdefine STBX( rS, rA, rB ) 
tfdefine STFD ( frD, rA, d ) 
#define STFDU( frD, rA, d ) 
tfdefine STFDUX ( frD, rA, rB ) 
fldefine STFDX ( frD, rA, rB ) 
^define STFS( frD, rA, d ) 
^define STFSU( frD, rA, d ) 
tfdefine STFSUX ( frD, rA, rB ) 
#define STFSX ( frD, rA, rB ) 
fldefine STH { rS, rA, d ) 
^define STHU( rS, rA, d ) 
^define STHUX ( rS, rA, rB ) 
^define STHX{ rS, rA, rB ) 
^define STW( rS, rA, d ) 
^define STWU( rS, rA, d ) 
fldefine STWUX ( rS, rA, rB ) 
^define STWX( rS, rA, rB ) 
^define SUB{ rD, rA, rB ) 
#define SUB_C( rD, rA, rB ) 
(long) (rD) ; 

tfdefine SUBFIC( rD, rA, SIMM ) 
tfdefine SUBI ( rD, rA, SIMM ) 
fldefine SUBIC_C( rD, rA, SIMM ) 
rD) ; 

tfdefine SUBIS( rD, rA, SIMM ) 
#define TEST_COUNT( label ) 
#define XOR( rA, rS, rB ) 
#define XOR_C( rA, rS, rB ) 
(long) (rA) ; 

#define XORI ( rA, rS, UIMM ) 
tfdefine XORIS ( rA, rS, UIMM ) 



(rA) = (ulong) (rS) » (SH) ; 

(rA) = (ulong) (rS) » (SH) ; CR[0] = ( 



* (char *) ( (rA) 
Mchar ♦) ( (rA) 
* (char *) ( (rA) 
*(char *) ( (rA) + 
♦(double *) ( (rA) 
* (double 
* (double 
* (double 
* (float * 
* (float * 
♦(float * 
♦(float * 
♦(short * 
♦(short * 
♦(short * 
♦(short * 
♦(long ♦) 
♦(long ♦) 
♦(long ♦) 
♦(long ♦) 
(rD) = 
(rD) = 



+ (d)) = (rS); 
+= (d)) = (rS); 
+= (rB)) = (rS); 
(rB)) = (rS); 
+ (d)) = (frD); 
) ((rA) += (d) ) = (frD) ; 
*) ((rA) += (rB)) = (frD), 
*) ((rA) + (rB)) = (frD); 
) ( (rA) + (d) ) = (frD) ; 
) ((rA) += (d)) = (frD) ; 
) ((rA) += (rB)) = (frD); 
) ((rA) + (rB)) = (frD); 
) ((rA) + (d)) = (rS) ,- 
) ((rA) += (d)) = (rS); 
) ((rA) += (rB) ) = (rS) ; 
) ((rA) + (rB)) = <rS); 
((rA) + (d)) = (rS) ; 
((rA) += (d)) = (rS); 
((rA) +.= (rB)) - (rS); 
(rA) + (rB)) = (rS); 
(rB) ; 

(rB) ; CR[0] = 



(rA) - 
(rA) - 



(rD) = (SIMM) - (rA) ; 

(rD) = (rA) - (SIMM); 

(rD) = (rA) - (SIMM); CR(0] . 

(rD) = (rA) - ( (SIMM) << 16) , 

if ( --CTR ) goto label; 

(rA) = (rS) A (rB) ; 

(rA) = (rS) A (rB) ; CR[0] = 



(long) ( 



(rA) = (rS) 
(rA) = (rS) 



(UIMM) ; 

( (UIMM) << 



16) ; 



#if defined ( BUILD MAX ) 



♦ VMX instructions 

*/ 

fldefine BR VMX ALL TRUE ( label ) 
tfdefine BR VMX ALL FALSE ( label ) 
#define BR VMX NONE TRUE ( label ) 
^define BR VMX SOME FALSE ( label ) 
#define BR_VMX_SOME_TRUE ( label ) 

tfdefine DSS ( STRM ) 

#define DSSALL 

^define DST( rA, rB, STRM ) 

#define DSTT ( rA, rB, STRM ) 

#define DSTST( rA, rB, STRM ) 

tfdefine DSTSTT ( rA, rB, STRM ) 

tfif defined ( COMPILE NON_ALIGNED ) 

#define VMX_ADDR_MASK 0 

#else 

#define VMX_ADDR_MASK 15 
flendif 



if ( 
if ( 
if ( 
if ( 
if ( 



CR[6] & 0x8 ) goto label; 
CR[6J & 0x2 ) goto label; 
CR[6] & 0x2 ) goto label; 
!(CR[6] & 0x8) ) goto label; 
i(CR[6] & 0x2) ) goto label; 



#if defined ( COMPILE_LVX_CHARS ) 

#define LVX ( vT, rA, rB ) \ 
{ \ 
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char *addr; \ 

addrV(char *) ( ( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 16; i++ ) \ 

(vT) .c[C_INDEX_MUNGE( i )] = addr[i]; \ 

fldefine LVEBX( vT, rA, rB ) \ 
{ \ 

char *addr; \ 

ulong i ; \ , , , , v x 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT) .c[C_INDEX_MUNGE( i )] = addr[0]; \ 

fldefine LVEHX ( vT, rA, rB ) \ 
{ \ 

char *addr; \ 

add? 9 = 1 (char M (( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).c[C INDEX MUNGE ( i )] = addr[0]; \ 
(vT) .c[C_INDEX_MUNGE( i + 1 )] = addrll]; \ 

tfdefine LVEWX( vT, rA, rB ) \ 
char *addr; \ 

addr 9 = 1 (char *)(< (ulong) (rA) + (ulong) (rB) ) & -3); \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).ctC INDEX MUNGE ( i )] = addr[0]; \ 
(vT).c[C INDEX MUNGE ( i + 1 )1 = addr[l]; \ 
(vT).ctC INDEX MUNGE ( i + 2 )] = addr [2]; \ 
(vT) .c[C_INDEX_MUNGE( i + 3 )] = addr [3]; \ 

> ttelif defined ( COMPILE_LVX_SHORTS ) 

#define LVX( vT, rA, rB ) \ 

short *addr; \ 

add? 9 ^ (short *) (((ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR__MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ 

(vT) . s [S_INDEX__MUNGE{ i )3 = addr[ij; \ 

#define LVEBX( vT, rA, rB ) \ 
{ \ 

char *addr; \ 

ulong i; \ v ( _ . x 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX__AD D R__MAS K ; \ 
(vT) .c [C_INDEX_MUNGE( i )] = addr[0]; \ 

#define LVEHX ( vT, rA, rB ) \ 

short *addr; \ 

addr 9 ^ 1 (short *) ( ( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) .s[S_INDEX_MUNGE( i )] - addr [0] ; \ 

#define LVEWX( vT, rA, rB ) \ 
short *addr; \ 

addr 9 =^ (short *)(( (ulong) (rA) + (ulong) (rB)) & -3); \ 
i = ((ulong) addr & VMX ADDR MASK) >> 1; \ 
(vT) .s[S_INDEX_MUNGE( i )] = addr [0] ; \ 
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<vT).s[S INDEX_MUNGE ( i + 1 )] = addr[l]; \ 

} 

tfelse 

^define LVX( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ,- \ 
for ( i = 0; i < 4 ; i++ ) \ 

(vT) . 1 [L_INDEX_MUNGE ( i )] = addr[i]; \ 

tfdefine LVEBX( vT # rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX_ADDR_MASK ; \ 
(vT).c[C INDEX_MUNGE( i )] = addr[0]; \ 

} 

tfdefine LVEHX ( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) .s [S_INDEX_MUNGE( i )) = addr[0]; \ 

^define LVEWX ( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & -3); \ 
i = ((ulong) addr & VMX ADDR MASK) » 2; \ 
(vT).l[L INDEX_MUNGE{ i )] = addr[0]; \ 

} 



#endif 



#if defined ( COMPILE_STVX_CHARS ) 

#define STVX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i ; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 16; i++ ) \ 

addr[i] = (vS) . c [C_INDEX_MUNGE ( i )]; \ 

} 

tfdefine STVEBX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i ; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS) . c [C INDEX MUNGE ( i )]; \ 

} 

#define STVEHX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i ; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = (ulong) addr .& VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i )3; \ 
addrEl] = (vS) . c [C_INDEX_MUNGE ( i + 1 )]; \ 
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^define STVEWX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i ; \ 

addr = (char +)(( (ulong) (rA) + (ulong) (rB)) & -3); \ 

i = (ulong) addr & VMX ADDR MASK; \ 

addr[0] = (vS).c(C INDEX MUNGE ( i )]; \ 

addrll] = (vS).c[C INDEX MUNGE ( i + 1 )3; \ 

addr [2] = (vS).c(C INDEX MUNGE ( i + 2 ) } ; \ 

addr [3] = (vS).c[C INDEX MUNGE ( i + 3 )]; \ 

} 

tfelif defined ( COMPILE_STVX_SHORTS ) 

tfdefine STVX ( vS, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB)) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ 

addrti] = (vS).s[S INDEX MUNGE ( i )]; \ 

} ~ 
tfdefine STVEBX ( vS, rA, rB ) \ 

{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i )]; \ 

} 

ffdefine STVEHX ( vS, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = ((ulong)addr & VMX ADDR MASK) >> 1; \ 
addrtO] = (vS).s[S INDEX MUNGE ( i )]; \ 

} " 
#define STVEWX ( vS, rA, rB ) \ 

{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB)) & ~3) ; \ 
i = ({ulong) addr & VMX ADDR MASK) >> 1; \ 
addr[0] = (vS).s[S INDEX MUNGE ( i )]; \ 
addr[l] = (vS).s(S INDEX MUNGE ( i + 1 )]; \ 

} " 
#else 

#define STVX ( vS, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 4; i++ ) \ 

addrti] = (vS) .1 [L_INDEX_MUNGE ( i )]; \ 

#define STVEBX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addrEO] = (vS).c[C INDEX MUNGE ( i )]; \ 

} 

tfdefine STVEHX ( vS, rA, rB ) \ 
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short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = ((ulong) addr & VMX ADDR MASK) >> 1; \ 
addr[0] = (vS) . s [S_INDEX_MUNGE ( i )]; \ 

tfdefine STVEWX ( vS, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 2; \ 
addr[0] = (vS) . 1 [L_INDEX_MUNGE ( i )]; \ 

} 



#endif 

#define LVSL BE ( vT, rA, rB ) \ 
{ \ 

ulong l, 3; \ 

j = ( (ulong) (rA) + (ulong) (rB) ) & VMX_ADDR_MASK; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc [i] = j + i; \ 

tfdefine LVSR_BE( vT, rA, rB ) \ 
{ \ 

ulong i, 3; \ 

j = 16 - (( (ulong) (rA) + (ulong) (rB) ) & VMX_ADDR_MASK) 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc[i] = j + i; \ 



#if defined ( LITTLE ENDIAN ) 

tfdefine LVSL( vT, rA, rB ) 

tfdefine LVSR( vT, rA, rB ) 
#else 

#define LVSL( vT, rA, rB ) 

#define LVSR( vT, rA, rB ) 
#endif 



LVSR BE ( vT, rA, rB ) ; 

LVSL_BE< vT, rA, rB ) ; 

LVSL BE ( vT, rA, rB ) ; 

LVSR_BE( vT, rA, rB ); 



#define LVXL( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define STVXL( vS, rA, rB ) STVX( vS, rA, rB ) 

ttdefine VADDFP ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

float a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = a + b; \ 

(vT) .f [i] = c; \ 



#define VADDSBS ( vT, vA, vB ) \ 
{ \ 

ulong 1; \ 
long itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (long) (vA) .c [i] + (long) (vB) . c [i] ; \ 
if ( itemp < -128 ) <vT).c[i] = -128; \ 
else if ( itemp > 127 ) <vT).c[i] = 127; \ 
else (vT).c[i] = (char)itemp; \ 



#define VADDSHS ( vT, vA, vB ) \ 
{ \ 
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ulong i; \ 
long itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (long) (vA) .s [i] + (long) (vB) . s (i) ; \ 
if ( itemp < -32768 ) (vT).s[i] = -32768; \ 
x else if ( itemp > 32767 ) (vT).s[i] » 32767; \ 

else (vT).s[i] = (short) itemp; \ 

. lx 

Sdefine VADDSWS ( vT, vA, vB ) \ 
< N 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] + (vB).l[i); \ 

if ( ( (vA).l(i] > 0) ScSc ( (vB).lti] > 0) && (itemp < 0) ) \ 

(vT).l[i] = (long)0x7ffffff f ; \ 
else if ( ( (vA).l[i) < 0) && ( (vB).l[i] < 0) && (itemp > 0) ) \ 

(vT).l[i] = (long) 0x80000000; \ 
else (vT).l = itemp[i]; \ 

ttdefine VADDUBM ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).ucti] = (vA).uc[i] + (vB).uc[i); \ 

^define VADDUBS{ vT, vA f vB ) \ 
{ \ 

ulong i, itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (ulong) (vA) .ucti] + (ulong) (vB) .uc [i] ; \ 

if ( itemp > 255 ) (vT).uc[i] = 255; \ 

else (vT).uc[i] = (uchar) itemp; \ 

#define VADDUHM ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = (vA).us[i] + (vB).us[i]; \ 

} 

^define VADDUHS ( vT, vA, vB ) \ 
{ \ 

ulong i # itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (ulong) (vA) .us[i] + (ulong) (vB) .us[i] ; \ 

if ( itemp > 65535 ) (vT).uc[i] = 65535; \ 

else (vT).uc[i] = (ushort) itemp; \ 

#define VADDUWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i - 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] + (vB).ul(i]; \ 

} 

#define VADDUWS ( vT, vA, vB ) \ 
{ \ 

ulong i # itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).ul[i] + (vB) .ul [i] ; \ 

if ( itemp < (vA).ul[i] ) (vT) .ul [i] = (ulong) Oxfffff fff ; \ 
else (vT).ul[i] = itemp; \ 
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#define VAND( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for { i - 0; i < 4; i + + ) \ 

(vT).ul[i] = (vA).ul[i] & (vB).ul[i]; \ 

ttdefine VANDC( vT, vA ( vB ) \ 
{ \ 

ulong l; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[iJ & ~(vB) .ulti] ; \ 

#define VCMPEQFP( vT, vA f vB ) \ 
{ \ 

ulong i; \ 

for { i = 0; i < 4; i++ ) \ 

<vT).ul[i] = ( (vA).f[i] == (vB).f[i] ) ? Oxffffffff : 0; \ 

tfdefine VCMPEQFP_C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f ; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).f[i] == (vB).fti] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul [i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

tfdefine VCMPEQUB( vT, vA, vB ) \ 
{ \ 

ulong x; \ 

for ( i = 0; i < 16; i++ ) \ 

<vT).uc[i] = ( (vA).ucli] == (vB).uc[i] ) ? Oxff : 0; \ 

#define VCMPEQUB_C( vT, vA, vB ) \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for ( i = 0; i < 16; i++ ) { \ 

(vT).uc[i] = ( (vA).uc[i] == (vB).ucli] ) ? Oxff : 0; \ 
t &= (vT) .ucti] ; \ 
f |= (vT) .uc[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[63 = 0; \ 

#define VCMPEQUH ( vT, vA, vB ) \ 
{ \ 

ulong l; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i3 = ( (vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 

tfdefine VCMPEQUH_C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ushort t, f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 
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(vT).us[i] = ( (vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 
t &= (vT) -us[i] ; \ 
f |= (vT) .us[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( If ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

tfdefine VCMPEQUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).ul[i] == (vB).uim ) ? Oxffffffff : 0; \ 

tfdefine VCMPEQUW C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT)'.ul[i] "- ( (vA).ulti] == (vB).ul[i] ) ? Oxffffffff : 0; \ 
t (vT) .ul[i] ; \ 
f |= (vT) .ul[i] ; \ 

} \ 

if ( t ) CRI6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGEFP{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).ftil >= (vB).fti] ) ? Oxffffffff : 0; \ 

#define VCMPGEFP_C< vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).f[i] >= (vB).f[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTFP( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

<vT).ul[i] - { (vA).f[i] > (vB).f(i] ) ? Oxffffffff : 0; \ 

#define VCMPGTFP C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i+4 ) { \ 

(vT).ultil = ( <vA).f[iJ > (vB).f[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ulfi] ; \ 

} \ 
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if { t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTSB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).ucti] = ( (vA).c(i] > (vB).c[i] ) ? Oxff : 0; \ 

#define VCMPGTSB_C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f * 0; \ 

for ( i = 0; i' < 16; i + + ) { \ 

(vT).uc[i] = ( (vA).cti] > (vB).c[i] ) ? Oxff : 0; \ 
t &= (vT) .ucti] ; \ 
f |= (vT) .uc[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTSH( vT, vA, vB ) \ 
( \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = ( (vA).s(iJ > (vB).s[i] ) ? Oxffff : 0; \ 

#define VCMPGTSH_C( vT, vA, vB ) \ 
{ \ 

ulong l; \ 
ushort t, f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 

(vT).us[i] = { (vA).s[i] > (vB).s[i] ) ? Oxffff : 0; \ 
t &= (vT) .us [i] ; \ 
f |= (vT) .us[i] ; \ 

} \ 

if { t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTSW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).l[i] > (vB).lti] ) ? Oxffffffff : 0; \ 

#define VCMPGTSW_C( vT, vA ( vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[il = ( (vA).l[i] > (vB).lli] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] =- 0x2; \ 

else CR[6] = 0; \ 

} 
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tfdefine VCMPGTUB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc(i] = ( (vA).uc[i] > <vB).uc[i] ) ? Oxff : 0; \ 

} 

tfdefine VCMPGTUB C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for ( i = 0; i < 16; i++ ) { \ 

(vT).uc[i] = ( (vA).uc[i] > (vB).uc[i] ) ? Oxff : 0; \ 
t &= <vT) .uc[i] ; \ 
f |= (vT) .uc[ij ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6} = 0; \ 

} 

#define VCMPGTUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = ( (vA).us[i] > (vB).us[i) ) ? Oxffff : 0; \ 

} 

#define VCMPGTUH C( vT, vA # vB ) \ 
{ \ 

ulong i; \ 
ushort t, f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 

(vT).us[i] = ( (vA).us[i] > (vB).us[i) ) ? Oxffff : 0; \ 
t &= (vT) .us[i] ; \ 
f |= (vT) .usti] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPGTUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).ul[i] > (vB).ulfi] ) ? Oxffffffff : 0; \ 

} 

#define VCMPGTUW C( vT, vA # vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).ul[i] > (vB).ul[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if { !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

^define VCFSX{ vT, vB, UIMM ) \ 
{ \ 

float f j ; \ 
ulong i, j; \ 
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j = (127 - ((UIMM) & Oxlf)) « 23; \ 

fj = * (float *)&j; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ftil = (float) ((vB) .Hi] ) / f j ; \ 

ftdefine VCFUX( vT, vB, UIMM ) \ 
{ \ 

float fj; \ 
ulong i, j; \ 

j = (127 - ((UIMM) & Oxlf)) « 23; \ 

fj = * (float *)&j; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = (float) ( (vB) .ul [i] ) / f j ; \ 

} 

#define VCTSXS( vT, vB, UIMM ) \ i 

< X i v 

float f, g, max, scale; \ 

ulong i; \ 

long 1; \ 

i = (127 + 31) « 23; \ 

max = * (float *)&i; \ 

i = (127 + {(UIMM) & Oxlf)) « 23; \ 

scale = * (float *)&i; \ 

for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i]; \ 

g = f * scale; \ 

if ( g <= -max ) 1 = 0x80000000; \ 
else if ( g >= max ) 1 = 0x7fffffff; \ 
else 1 = (long)f « ( (UIMM) & Oxlf); \ 
(vT) .l[il = 1; \ 

, ,N 

#define VCTUXS ( vT, vB, UIMM ) \ 

{ \ , . 

float f, g, max, scale; \ 

ulong i, ul; \ 

i = (127 + 32) « 23; \ 

max = * (float *)&i; \ 

i = (127 + ( (UIMM) & Oxlf)) « 23; \ 

scale = * (float *)&i; \ 

for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i] ; \ 

g = f * scale; \ 

if ( g <= 0 ) ul - 0; \ 

else if ( g >= max ) ul = Oxffffffff; \ 

else ul = (ulong) f « ( (UIMM) & Oxlf); \ 

(vT) .ul [i] = ul; \ 

#define VEXPTEFP ( vT, vB ) \ 

for ( i = 0; i < 4; i++ ) \ t r(1 ., . 

(vT).f[i] = exp(0. 693147180559945 * (vB).f[i]>; \ 

#define VLOGEFP( vT f vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = 1.442695040888963 * log ( (vB) . f [i] ) ; \ 

^define VMADDFP ( vT, vA, vC, vB ) \ 

{ \ 

ulong i; \ 

float a, b, c, d; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i) ; \ 

b = (vB) .f [i] ; \ 

c = (vC) .f [i] ; \ 
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} \ 



d = a * c; \ 
d = b + d; \ 
(vT) .f [i] = d; \ 



tfdefine VMAXFP( vT, vA, vB ) \ 
{ X 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = ((vA).f[i] >= (vB).f[il) ? (vA).f(i] 

1 

tfdefine VMAXSB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i x 0; i < 16; i++ ) \ 

(vT).c[i] = (<vA).c[i] >= (vB).c[i]) ? (vA).c[i) 

#define VMAXSH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i] = ((vA).s[i] >= (vB).stil) ? (vA).s[i] 

} 

#define VMAXSW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).l[il = ((vA).l[i} >= (vB).l[i3) ? (vA).ltil 

#define VMAXUB ( vT, vA, vB ) \ 
{ \ 

ulong l; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = ((vA).uc[i] >= (vB).ucti]) ? (vA).ucCi) 

#define VMAXUH ( vT, vA, vB ) \ 
{ N 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = {{vA).us[i] >= (vB).usti]) ? (vA).us[i] 

} 

#define VMAXUW ( vT, vA, vB ) \ 
{ X 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ulEi] = ((vA).ul[i] >= (vB).ul[il) ? (vA).ul[i] 

#define VMHADDSHS { vD, vA, vB, vC ) \ 
{ X 

ulong l ; \ 
long a; \ 

for { i = 0; i < 8; i++ ) { \ 

a = (long) (vA) .s[i] * (long) (vB) . s [i] ; \ 
a »= 15; \ 

a += (long) (vC) .s [i] ; \ 

if ( a > 32767 ) a = 32767; \ 

else if { a < -32768 ) a = -32768; \ 

(vD) .s[i] = (short) a; \ 

, ,s 

#define VMHRADDSHS ( vD, vA, vB, vC ) \ 
{ \ 

ulong l; \ 
long a; \ 

for ( i = 0; i < 8; i++ ) { \ 

a= (long) (vA) .s[i] * (long) (vB) . s [i] ; \ 
a += 0x00004000; \ 



(vB) .f [i] ; \ 



(vB) .c[i] ; \ 



(vB) .sti] ; \ 



(vB) .l[i] ; \ 



(vB) .uc[il ; \ 



(vB) .usEi] ; \ 



(vB) .ul[i] ; \ 
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} 



} \ 



a »= 15; \ 

a += (long) (vC) .s[i] ; \ 
if ( a > 32767 ) a = 32767; \ 
else if ( a < -32768 ) a = -32768; \ 
(vD) .s [i] = (short)a; \ 



fldefine VMINFP( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = ((vA).f[i] <= (vB).f[il) ? (vA).ftil 

^define VMINSB{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).c[i] = ((vA).c[i] <= (vB).c[il) ? (vA).c[i] 

} 

#define VMINSH( vT, vA, vB ) \ 

{ \ . ... v 
ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).s[i) = ((vA).s(i] <= (vB).s[i]) ? (vA).s[i] 

#define VMINSW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).l[i] = ((vA).l[i] <= (vB).lCi]) ? (vA).l[i] 

#define VMINUB( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = ((vA).uc[i] <= (vB).uc[il) ? (vA).uc[i] 

} 

#define VMINUH( vT, vA, vB ) \ 
{ \ 

ulong l; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).us[i3 = ((vA).us[i] <= (vB).us[i3) ? <vA).us[i] 



(vB) .f ti] ; \ 



(vB) .c[i] ; \ 



(vB) .s[i] ; \ 



(vB) .l[i] ; \ 



} 

#define VMINUW( vT, vA, vB ) \ 



\ 



} 



ulong i ; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).ulti] = ((vA).ul[i] <= (vB).ul[i]) ? (vA) .ul [i] 



(vB) .uc[i] ; \ 



(vB) .us [i] ; \ 



(vB) .ul[i] ; \ 



#define VMLADDUHM ( vD, vA, vB, vC ) \ 
{ \ 

ulong i ; \ 

ulong a, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (ulong) (vA) .us [i] ; \ 
b = (ulong) (vB) .us[i] ; \ 
c - (ulong) (vC) .us[i] ; \ 
c += (a * b) ; \ 
(vD).us[i] = (ushort)c; \ 

tfdefine VMR( vD, vS ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
(vD) .ul [i] = (vS) .ul [i] ; \ 
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Jtdefine VMRGHB_BE{ vT, vA, vB ) \ 
( \ 

VMX reg v; \ 
ulong i f j; \ 

for ( i = 0; i < 8; i++ ) { \ 
j = i + i; \ 
v.uclj] = (vA).uc[ij; \ 
v.uc[(j+l)] = (vB) .uc[i] ; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

) 

tfdefine VMRGHH BE { vT, vA, vB ) \ 
{ \ 

VMX reg v ; \ 
ulong i, j; \ 

for ( i » 0; i < 4; i++ ) { \ 
j = i + i; \ 
v.us [j] = (vA) .us[i] ; \ 
v.us[(j+l)] = (vB) .us[i] ; \ 

}\ 

for { i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

} 

#define VMRGHW_BE( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 

for ( i = 0; i < 2; i++ ) { \ 
j = i + i; \ 
v.ullj] = (vA) .ul[i]; \ 
v.ul[(j+l)] = (vB) .ul[i] ; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VMRGLB_BE{ vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 

for ( i = 0; i < 8; i++ ) { \ 
j = i + i; \ 

v.uctj] = (vA) .ucE(8+i)] ; \ 
v.uc[(j+l)] = (vB) .uc[(8+i)); \ 

} \ 

for { i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

^define VMRGLH BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 

for { i = 0; i < 4; i++ ) { \ 
j = i + i; \ 

v.us[j] =. (vA) .us [ (4 + i) ] ; \ 
v.ust(j + D] = (vB) .us[(4 + i)] ; \ 

} \ 

for { i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ulEil ; \ 

} 

#define VMRGLW_BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 

for { i = 0; i < 2; i++ ) { \ 
j = i + i; \ 

v.ul[j] = (vA) .ul[(2+i)l ; \ 
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} 



v.ulf {j + D J = (vB) ,ul[(2+i)] ; \ 

for ( i = 0; l < 4; 1++ ) \ 
(vT) .ul[i] = v.ulCi] ; \ 



#if defi 
#def ine 
#def ine 
tfdef ine 
#def ine 
^define 
#def ine 
Helse 
#def ine 
#def ine 
tfdef ine 
tfdef ine 
#def ine 
#def ine 
#endif 



ned{ LITTLE 
VMRGHB { vT ( " 
VMRGHH ( vT, 
VMRGHW ( vT, 
VMRGLB ( vT # 
VMRGLH ( vT, 
VMRGLW ( vT, 

VMRGHB ( vT # 
VMRGHH ( vT, 
VMRGHW ( vT, 
VMRGLB ( vT, 
VMRGLH { vT, 
VMRGLW ( vT, 



END I AN ) 
~vA, vB ) 



vA, 
vA # 
vA, 
vA, 
vA, 

vA, 
vA, 
vA, 



vB ) 
vB ) 
vB ) 
vB ) 
vB ) 

vB ) 
vB ) 
vB ) 



VMRGLB BE ( vT, vB, vA ) 

VMRGLH BE ( vT, vB, vA ) 

VMRGLW BE( vT, vB, vA ) 

VMRGHB BE ( vT # vB, vA ) 

VMRGHH BE ( vT, vB, vA ) 

VMRGHW BE ( vT, vB, vA ) 



vA, vB ) 
vA, vB ) 
vA, vB ) 



VMRGHB BE ( vT, vA, 
VMRGHH BE ( vT, vA, 
VMRGHW BE ( vT, vA, 
VMRGLB BE ( vT, vA, 
VMRGLH BE ( vT, vA, 
VMRGLW BE { vT, vA, 



vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 



fldefine VMSUMMBM ( vT, vA, vB, vC ) \ 
{ X 

ulong l, ]; \ 
long a, c; \ 
ulong b; \ 

for ( i = 0; i < 4; i++ ) { \ 

c = (vC) .l[i] ; \ 

for { j = 0; j < 4; j++ ) { \ 
a = (long) (vA) .c[4*i+jj ; \ 
b- (ulong) (vB) .uc[4*i+j] ; \ 
c += (a * b) ; \ 

1 \ 

(vT) .l[i] = c; \ 

#define VMSUMSHM( vT, vA, vB, vC ) \ 
f \ 

ulong l, 3; \ 
long a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .l[i] ; \ 
for { j = 0; j < 2; j++ ) { \ 
a = (long) (vA) .s[4*i+j] ; \ 
b = (long) (vB) .s[4*i+j] ; \ 
c += (a * b) ; \ 

} \ 

(vT) .l[i] = c; \ 

, ,v 

((define VMSUMSHS ( vT, vA, vB, vC ) \ 
{ \ 

ulong l, i; \ 
long a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
c = (double) (vC) .l[i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 
a = (long) (vA) .s[4*i+jj ; \ 
b = (long) (vB) .s[4*i + j} ; \ 
c += (double) (a * b) ; \ 

\ 



} \ 



if ( c >= 2147483647.0 ) C = 2147483647.0; \ 
else if { c -21474 83648.0 ) c = -2147483648.0; \ 
(vT) .1 [i] = (long)c; \ 
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} 

fldefine VMSUMUBM( vT, vA, vB, vC ) \ 
{ \ 

ulong i, j; \ 
ulong a, b ( c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) -ul ti] ; \ 
for ( j = 0; j < 4; j++ ) { \ 
a = (ulong) (vA) .uc[4*i+j] ; \ 
b = (ulong) (vB) .uc[4*i + j] ; \ 
c 4= (a * b); \ 

} \ 

(vT) .ul[i} = c; \ 

#define VMSUMUHM( vT, vA, vB, vC ) \ 
{ \ 

ulong i, j; \ 
ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul[i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 
a = (ulong) (vA) .us [4*i + j] ; \ 
b = (ulong) (vB) .us [4*i+j] ; \ 
c (a * b) ; \ 

I \ 

(vT) .ul [i] = c; \ 

#define VMSUMUHS( vT, vA, vB, vC ) \ 
{ \ 

ulong i, j; \ 
ulong a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
c = (double) (vC) .ul [i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 

a = (ulong) (vA) .us [4*i+j3 ; \ 

b = (ulong) (vB) .us [4*i+j] ; \ 

c (double) (a * b) ; \ 

} \ 

if ( C >= 4294967295.0 ) C = 4294967295.0; \ 
(vT).ulti] = (ulong)c; \ 

fldefine VMULESB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

long a, b, c; \ 

for { i = 0; i < 8; i++ ) { \ 

a = (long) (vA) .c[2*i] ; \ 

b = (long) (vB) .c[2*i] ; \ 

c = a * b; \ 

(vT) .s[i] = (short) c; \ 

idefine VMULESH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (long) (vA) .s[2*i] ; \ 

b = (long) (vB) .s[2*i] ; \ 

c = a * b; \ 

(vT) .1 [i] = (long)c; \ 
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^define VMULEUB{ vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (ulong) (vA) .uc [2*i] ; \ 

b = (ulong) (vB) .uc[2*i] ; \ 

c = a * b; \ 

(vT).us[i] = (ushort)c; \ 

. |N 

#define VMULEUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (ulong) (vA) .us [2*i] ; \ 

b= (ulong) (vB) .us [2*i] ; \ 

c = a * b; \ 
^ ^(vT).ulEij = (ulong) c; \ 

#define VMULOSB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (long) (vA) .c[2*i+l] ; \ 
b = (long) (vB) .c[2*i+l] ; \ 
c = a * b; \ 
(vT) .s[i] = (short) c; \ 

, ,x 

#define VMULOSH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (long) (vA) .s [2*i+l] ; \ 

b = (long) (vB) ,s[2*i+l] ; \ 

c = a * b; \ 

(vT) .l[i] = (long)c; \ 

, ,x 

#define VMULOUB ( vT, vA, vB ) \ 
{ \ 

ulong i? \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (ulong) (vA) .uc [2*i + l] ; \ 

b= (ulong) (vB) .uc [2*i+l] ; \ 

c = a * b; \ 

(vT).us[i] = (ushort)c; \ 

, ,x 

#define VMULOUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (ulong) (vA) .us [2*i + l] ; \ 

b= (ulong) (vB) .us [2*i+l] ; \ 

c = a * b; \ 

(vT) .ul [i] = (ulong)c; \ 

} 1 X 

#define VNMSUBFP( vT, vA, vC, vB ) \ 
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{ \ 

ulong 1; \ 

float a, b, c, d; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = (vC) .f [i] ; \ 

d = a * c; \ 

d = b - d; \ 

(vT) .f [i] = d; \ 

((define VNOR( vT, vA, vB ) \ 
{ \ 

ulong 1 ; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ~((vA).ul[i] I (vB).ul[il); \ 

Adeline VN0T( vT, vA ) VN0R( vT, vA, vA ) 

tfdefine V0R( vT, vA, vB ) \ 
{ \ 

ulong 1; \ 

for { i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ulti] I (vB).ul[il; \ 

tfdefine VPERM_BE( vT, vA, vB, vC ) \ 
{ \ 

VMX reg v; \ 

ulong field, i; \ 

for ( i = 0; i < 16; i++ ) { \ 

field = (vC).uc[i]; \ v 
v uc[i] = ( field < 16 ) ? (vA) .uc [field] : (vB) . uc [field - 16]; \ 

for ( i = 0; i < 4; 1++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

tfdefine VPKUHUM_BE( vT, vA, vB, base ) \ 

{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 
v.ucfi] = (vA) .uc[(j)l; \ 
v.uc[i+83 = (vB) .uc[(j)l ; \ 
j += 2; \ 

> X 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKUHUS_BE( vT, vA, vB, base ) \ 
{ X 

VMX reg V; \ 
ulong i , j ; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

v.ucti] = (vA) .uc[(j A l)] ? (uchar)255 : (vA) .uc [ (3 ) ] ; \ 
v.uc[i+8] = (vB) .uc[(j A l)l ? (uchar)255 : (vB) .uc [ ( j ) 3 ; \ 
j +=2; \ 

for ( i = 0; i < 4; i++ ) \ 
" (vT) .ul [i] = v.ul [i] ; \ 

#define VPKSHUS_BE( vT, vA, vB, base ) \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 
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for { i = 0; i < 8; i++ ) { \ 

if ( (vA).s[i] <= 0 ) v.ucfi) = 0; \ 

else if ( <vA).s[i] >= 255 ) v.ucfi] = 255; \ 

else v.uc(i) = (vA).uc(j]; \ 

if ( (vB).s[i] <= 0 ) v.uc[i+8] = 0; \ 

else if ( (vB).s(i] >= 255 ) v.ucti+8] = 255; \ 

else v.uc[i+8] = (vB).uctj]; \ 

j +=2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ulti] = v.ul[i] ; \ 

#define VPKSHSS_BE( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

if ( <vA).s[i] <= -128 ) v.c[i] = -128; \ 
else if ( (vA).s[i] >= 127 ) v.c[i] = 127; \ 
elsev.cji] = (vA).c[j]; \ 

if ( (vB).sti] <= -128 ) v.c'li + 8] = -128; \ 
else if ( (vB).s[i] >= 127 ) v.c[i+83 = 127; \ 
else v.cEi+8] = (vB).ctj]; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ulti] = v.ulfi] ; \ 

} 

#define VPKUWUM BE { vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 4; i++ ) { \ 
v.usti] = (vA) .us[ (j) ] ; \ 
v.us[i+4] = (vB) .usE(j)] ; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

} 

#define VPKUWUS_BE( vT, vA, vB, base ) \ 
{ \ 

VMX reg v ; \ 
ulong i, j; \ 
j = base; \ 

for { i = 0; i < 4; i++ ) { \ 

v.usti] = (vA) .us[(j A l)] ? (ushort)65535 : (vA) . us [ { j ) ] ; \ 
v.us[i + 4] = (vB) ,us[(j A l)] ? (ushort)65535 : (vB) .us [ ( j ) ] ; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul[i] ; \ 

} 

#define VPKSWUS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i # j ; \ 
j = base; \ 

for ( i = 0; i < 4 ; i++ ) { \ 

if { (vA).l[i] <= 0 ) v.usti] = 0; \ 

else if ( (vA).lti] >= 65535 ) v.usti] = 65535; \ 

else v.usti] = (vA).ustjl; \ 

if ( (vB).lti] <= 0 ) v.us(i+4] = 0; \ 

else if { (vB).l[i] >= 65535 ) v.us[i+4] = 65535; \ 

else v.usti+4] = (vB).ustj]; \ 
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j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

} 

tfdefine VPKSWSS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

if ( (vA).l[i] <= -32768 ) v.s[i] = -32768; \ 
else if ( (vA).l[iJ > = 32767 ) v.sfi] = 32767; \ 
else v.s[i] = (vA).s[j] r - \ 

if ( (vB).lti] <= -32768 ) v.s[i+8] = -32768; \ 
else if ( (vB).l[i] >= 32767 ) v.s[i+8] = 32767; \ 
else v.s[i + 8] = (vB).s[j]; \ 
j += 2; \ 

} \ 

for ( i - 0 ; i < 4 ; i++ ) \ 
(vT) .ul [i] = v.ulti] ; \ 

} 



nit defined ( LITTLE ENDIAN ) 
^define VPERM ( vT, vA, vB, vC ) 

#define VPKUHUM( vT, vA, vB ) 

tfdefine VPKUHUS ( vT, vA # vB ) 

#define VPKSHUS ( vT, vA, vB ) 

#define VPKSHSS( vT, vA, vB ) 

#define VPKUWUM( vT, vA, vB ) 

tfdefine VPKUWUS( vT, vA, vB ) 

#define VPKSWUS ( vT, vA, vB ) 

#define VPKSWSS ( vT, vA, vB ) 
#else 

ttdefine VPERM ( vT, vA, vB, vC ) 

#define VPKUHUM( vT, vA, vB ) 

#define VPKUHUS { vT, vA, vB ) 

#define VPKSHUS ( vT, vA, vB ) 

#define VPKSHSS( vT, vA, vB ) 

#define VPKUWUM( vT, vA f vB ) 

#define VPKUWUS ( vT, vA, vB ) 

#define VPKSWUS ( vT, vA, vB ) 

#define VPKSWSS ( vT, vA, vB ) 
#endif 



VPERM BE ( vT, vB, vA, vC ); 



VPKUHUM 


BE ( 


vT, 


vB, 


vA, 


0 


) 


VPKUHUS 


BE { 


vT, 


vB, 


vA, 


0 


) 


VPKSHUS 


BE ( 


vT, 


vB, 


vA, 


0 


) 


VPKSHSS 


BE { 


vT, 


vB, 


vA, 


0 


) 


VPKUWUM 


BE ( 


vT, 


vB, 


vA, 


0 


) 


VPKUWUS 


BE { 


vT, 


vB, 


vA, 


0 


) 


VPKSWUS 


BE ( 


vT, 


vB, 


vA, 


0 


) 


VPKSWSS 


BE { 


vT, 


vB, 


vA, 


0 


) 


VPERM BE ( vT, vA, vB, vC ) 




VPKUHUM 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKUHUS 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKSHUS 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKSHSS 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKUWUM 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKUWUS 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKSWUS 


BE ( 


vT, 


vA, 


vB, 


1 


) 


VPKSWSS 


BE ( 


vT, 


vA, 


vB ( 


1 


) 



#define VREFP( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT) .f [ij = 1.0 / (vB) .f [i] ; \ 

} 

#def ine.VRFIM( vT, vB ) \ 
{ \ 

float f, max, r; \ 
ulong i ; \ 

i = (127 + 31) << 23; \ 
max = * (float *)&i; \ 
for ( i = 0; i < 4; i++ ) { \ 
f = (VB) .f [i] ; \ 

if ( (f >= -max) && (f < max) ) { \ 
r = (float) ((long)f) ; \ 
if ( r > f ) --r; \ 
f = r; \ 

} \ 

(vT) .f [i] = f ; \ 



#define VRFIN( vT, vB ) \ 
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float f, r, s; \ 
ulong i; \ 
long lr; \ 

for ( i - 0; i < 4; i++ ) { \ 
s = f = (vB) . f [il ; \ 
if ( f < 0.0 ) f = -f; \ 
r = f + 0.5; \ 
if ( r != f ) { \ 

lr = (long)r; \ 

f = (float) lr; \ 

if ( f == r ) f = (float) (lr & -1); \ 

!f\ s < o.o ) f = -f; \ 

(vT) .f [i] = f ; \ 

} } \ 

ffdefine VRFIP( vT f vB ) \ 

float f, max, r; \ 
ulong i; A 

i = (127 + 31) « 23; \ 
max = * (float *)&i; \ 
for ( i = 0; i < 4; i++ ) { \ 
f = (vB) .f [i] ; \ 

if ( (f >= -max) && (f < max) ) { \ 
r = (float) ( (long) f ) ; \ 
if ( r < f ) ++r; \ 
f = r ; \ 

(vT) .f [i] = f ; \ 

} \ 

} 

fldefine VRFIZ( vT, vB ) \ 
{ \ 

float f, max; \ 
ulong i; \ 

i = (127 + 31) << 23; \ 
max = * (float *)&i; \ 
for ( i = 0; i < 4; i++ ) { \ 
f = (vB) .f [i] ; \ 

if ( (f >= -max) (f < max) ) \ 

f = (float) ( (long)f ) ; \ 
(vT) .f [i] = f ; \ 

, n 

#define VRLB ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB).uc[i] & 0x7; \ 

(vT) uc[i] = ((vA).uc[i] « sh) | ((vA).ucti] » (8-sh)); \ 

} \ 

#define VRLH ( vT, vA, vB ) \ 
ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 

sh = (vB) .usti] & Oxf; \ . 
(vT) us[i] = ((vA).us[i] « sh) | ((vA).usti] » (16-sh) ) ; \ 

} } \ 

^define VRSQRTEFP ( vT, vB ) \ 

for ( i = 0; i < 4; i++ ) \ 

(vT) f[i] = 1.0 / sqrt ( (vB) .f [i] ) ; \ 

} 
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( (vA) .ul [1] » (32-sh) ) ; \ 
( (vA) .ul [2] » (32-sh) ) ; \ 
{ (vA) .ul [3] » (32-sh) ) ; \ 



#define VRLW( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
sh = (vB).ulti) & Oxlf; \ 

(vT) ul[i] = <<vA).ul[i] « sh) | ((vA).ul[i] » (32-sh)); \ 

, ix 

Jfdefine VSEM vT, vA, vB, vC ) \ 

ulong atemp, btemp, i; \ 

for ( i = 0; i < 4; i++ ) { \ 

atemp = (vA).ulli] & ~(vC).ul[i]; \ 

btemp = (vA).ulti] & (vC).ul[il; \ 

(vT).ul[i] = atemp | btemp; \ 

, n 

#define VSL( vT, vA, vB ) \ 

ulong i, sh; \ 
Sh = (vB) .ul [33 & 0x7; \ 

(vT).ul[03 = ((vA).ul[03 << sh) 

(vT).ultl] - ((vA).ul[!3 « sh) 

(vT).ul[2] = ((vA).ul[2] « sh) 

(vT).ul[33 = (vA).ul[33 « sh; \ 

#define VSLDOI ( vT # vA, vB, UIMM ) \ 
{ \ 

VMX reg v; \ 
ulong i, j , sh; \ 
sh = (UIMM) & Oxf; \ 
for ( i = 0; i < (16-sh) ; i++ ) \ 

v.uc[i3 = (vA) .uc [i+sh3 ; \ 
for ( j = i; j < 16; j++ ) \ 

v.uc[j3 = (vB) .uc tj-i] ; \ 
for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i3 = v.ul til ; \ 

#define VSLB( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB) .uc [i] & 0x7; \ 
(vT).uc[i] = (vA).uc[il << sh; \ 

#define VSLH( vT f vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB) .us [i3 & Oxf; \ 
(vT).us[i] = (vA).us[iJ << sh; \ 

} \ 

) 

#define VSLO( vT, vA, vB ) \ 

ulong i, j, sh; \ 
sh = ((vB).ul[33 >> 3) & Oxf; \ 
for ( i = 0; i < (16-sh); i++ ) \ 

(vT).uc[i] = (vA) .ucti+sh] ; \ 
for ( j = i; j < 16; j++ ) \ 

(vT) .uc [j3 =0; \ 

^define VSLW( vT f vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
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} 



} \ 



sh = (vB) .ul [i] & Oxlf ; \ 
(vT).ul(i] = (vA).ulti] << sh; \ 



tfdefine VSR( vT, vA, vB ) \ 

ulong i, sh; \ 

sh = (vB) .ul [3] & 0x7; \ 
(vT).ul[3] = ((vA).ult3] >> sh) 
(vT).ul[2j = ((vA).ul[2) » sh) 
(vT).ultl] = ((vA).ul[l] » sh) 
(vT).ultO] - (vA).ultO] » sh; \ 

#define VSRAB( vT, vA, vB ) \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
Sh = (vB) .uc[i] & 0x7; \ 
(vT).c[i] = (vA).cti] » sh; \ 

} \ 

tfdefine VSRAH( vT, vA, vB ) \ 
ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB) .us ti] Sc Oxf ; \ 
<vT).s[i] = (vA).s[i] >> sh; \ 

} } \ 

#define VSRAW ( vT, vA, vB ) \ 
{ \ 

ulong i t sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
sh = (vB) .ul [i] & Oxlf; \ 
(vT) .1 [iJ = (vA) .1 [i] » sh; \ 

, ,x 

^define VSRB( vT, vA, vB ) \ 
ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB) .ucti] & 0x7; \ 
(vT) .uc[i] = (vA).uc[i] >> sh; \ 

, ,x 

#define VSRH ( vT, vA, vB ) \ 
ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB) .us[i] & Oxf; \ 
(vT).us[i] = (vA).us[i] >> sh; \ 

} } \ 

tfdefine VSRO{ vT, vA, vB ) \ 

long i, j, sh; \ 

sh = ( (vB) .ul[3] >> 3) & Oxf; \ 

for { i = 15; i >= sh; i-- ) \ 

(vT).uc[i] = (vA) .ucli-sh] ; \ 
for { j = i; j >= 0; j-- ) \ 

(vT) .uc [ j] = 0; \ 

#define VSRW( vT ( vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
sh = (vB) .ul [i] & Oxlf; \ 



( (vA) .ul [2] « (32-sh) ) ; \ 
( (vA) .ul [1] << (32-sh) ) ; \ 
{ (vA) .ul [0] << (32-sh) ) ; \ 
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(vT).ul(i) = (vA).ul(iJ » sh; \ 

((define VSPLTB( vT, vB, UIMM ) \ 
{ \ 

uchar c; \ 
ulong i; \ 

C = (vB).uc(C INDEX MUNGE { UIMM ) & Oxf ] ; \ 
for ( i = 0; i < 16; i++ ) \ 
<vT) .uc [i] = C; \ 

} 

tfdefine VSPLTH( vT, vB, UIMM ) \ 
{ \ 

ushort s; \ 
ulong i; \ 

s = (vB).US[S INDEX_MUNGE( UIMM ) & 0x7]; \ 
for ( i = 0; i < 8; i++ ) \ 
(vT) .us [i] = s ; \ 

} 

((define VSPLTVH vT, vB, UIMM ) \ 
{ \ 

ulong i, 1; \ 

1 = (vB).ul[L INDEX_MUNGE( UIMM ) & 0x3]; \ 
for ( i = 0; i < 4; i++ ) \ 
(vT) .Ul[i] = 1; \ 

} 

#define VSPLTISB( vT # SIMM ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
(vT).cEi] = (char) (SIMM) ; \ 

} 

#define VSPLTISH( vT, SIMM ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i] = (short) (SIMM) ; \ 

} 

#define VSPLTISW( vT, SIMM ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).l[i] = (long) (SIMM) ; \ 

1 

#define VSUBFP( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

float a, b, c; \ 

for (i=0;i<4;i++) {\ 

a = (vA) .f ti] ; \ 

b = (vB) .f [i] ; \ 

c = a - b; \ 

(VT) .f [i] = C; \ 

. ,x 

((define VSUBSBSf vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (long) (vA) .c [i] - (long) (vB) .c [i] ; \ 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) (vT).c[i] = 127; \ 
else (vT).c[i] = (char) itemp; \ 

} \ 

} 

^define VSUBSHS ( vT, vA, vB ) \ 
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{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (long) (vA) .s [i] - ( long) (vB) . s [i] ; \ 
if ( itemp < -32768 ) (vT).s[i] = -32768; \ 
else if ( itemp > 32767 ) <vT).s[i) = 32767; \ 
else (vT).s[iJ = (short) itemp; \ 

, ,v 

tfdefine VSUBSWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] - <vB).l[i]; \ 

if ( ( (vA).l[i] >= 0) && ( (vB).l[i] < 0) && (itemp < 0) ) \ 

(vT).l[i] = (long)0x7fff fff f ; \ 
else if ( ( (vA).lti] < 0) && ( (vB).l[i] > 0) && (itemp > 0) ) \ 

(vT).l[i] = (long) 0x80000000; \ 
else (vT).l = itempti]; \ 

, lx 

tfdefine VSUBUBMf vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = (vA).uc[i) - (vB).uc[i]; \ 

} 

tfdefine VSUBUBS ( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 

for ( i = 0; i < 16; i++ ) { \ 

if ( (vA).uc[i] <= (vB).uc[i] ) (vT).uc(i] = 0; \ 
else (vT).uc[i] » (vA).uc[i] - (vB).uc[il; \ 

#define VSUBUHM ( vT, vA, vB ) \ 
< N 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = (vA).us[i] - (vB).us[i]; \ 

} 

#define VSUBUHS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( 1 = 0; i < 8; i++ ) { \ 

if ( (vA).usfi] <= (vB).us[i) ) (vT).us[i] = 0; \ 
else (vT).usti] = (vA).us[i] - (vB).us[i] ; \ ' 

, )x ' 

#define VSUBUWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0 ; i < 4 ; i++ ) \ 

(vT).ul[i] = (vA).ul[i] - (vB).ul[ij; \ 

} 

#define VSUBUWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) { \ 

if ( (vA).ul[i] <= (vB).ul[i] ) (vT).ul[i] = 0; \ 
else (vT).ultil = (vA).ul[i] - (vB).ul[i); \ 

#define VSUMSWS ( vT, vA, vB ) \ 
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ulong i; \ 
double sum; \ 

sum = (double) (vB) .1 [L INDEX_MUNGE( 3 )]; \ 
for ( i = 0; i < 4; i++ ) \ 

sum += (double) (vA) .1 [i] ; \ 
if { sum > (double) (0x7fffffff ) ) \ 

(vT).l[L INDEX MUNGE ( 3 )3 = 0x7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = 0x80000000; \ 
else \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = (long) sum; \ 

ffdefine VSUM2SWS( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

double suml, sum2; \ 

suml = (double) (vB) .1 [L INDEX MUNGE ( 1 )]; \ 
sum2 = (double) (vB) .1 [L_INDEX_MUNGE ( 3 )]; \ 
for { i = 0; i < 2; i + + ) { \ 

suml += (double) (vA) . 1 [L INDEX MUNGE ( i ) ] ; 

sum2 + = (double) (vA) .1 [L_INDEX_MUNGE ( i+2 )] 

if\ suml > (double) (0x7fffffff) ) \ 

(vT).l[L INDEX MUNGE ( 1 )] = 0x7fffffff; \ 

else if ( suml < (double) (0x80000000) ) \ 

(vT) . 1 [L_INDEX_MUNGE ( 1 )] = 0x80000000; \ 

else \ 

(vT) .1 [L_INDEX MUNGE ( 1 )] = (long) suml; \ 
if ( sum2 > (double) (0x7f ff ffff ) ) \ 

(vT).l[L INDEX MUNGE ( 3 )] = 0x7fffffff; \ 
else if ( sum2 < (double) (0x80000000) ) \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = 0x80000000; \ 
else \ 

(vT) .1 [L_INDEX_MUNGE( 3 )] = (long)sum2; \ 

#define VSUM4SBS( vT, vA, vB ) \ 
{ \ 

ulong i, j; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .1 [i] ; \ 
for ( j = 0; j < 4; j++ ) { \ 

sum += (double) (vA) .c [4*i + j]; \ 
if ( sum > (double) (0x7fffffff ) ) \ 

(vT) .1 [i] = 0x7fffffff ; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT) .1 [i] = 0x80000000; \ 
else \ 

(vT).l[i] = (long)sum; \ 



#define VSUM4SHS( vT, vA, vB ) \ 
{ \ 

ulong i, 3; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .1 til ; \ 
for ( j = 0; j < 2; j++ ) { \ 

sum += (double) (vA) .s [2*i + j] ; \ 
if ( sum > (double) (0x7fffffff ) ) \ 

(vT) .1 [ij = 0x7fffffff ; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT) .1 [i] = 0x80000000; \ 
else \ 

(vT) .1 [i] = (long) sum; \ 
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ltdefine VSUM4UBS( vT, vA, vB ) \ 
{ \ 

ulong 1, j; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .ul [i) ; \ 
for { j = 0; j < 4; j++ ) { \ 

sum += (double) (vA) .uc [4*i + 3] ; \ 

if ( sum > (2.0 * (double) (0x7fffffff) + 1.0) ) \ 

(vT).ul[i] = Oxffffffff; \ 
else \ 

(vT).ul[i] = (ulong)sum; \ 

#define VUPKHSB_BE( vT, vB ) \ 
{ \ 

long 1; V 

for ( i - 7; 1 >= 0; 1-- ) \ 

(vT).s[i3 = (short) (vB) .c[i] ; \ 

#define VUPKHSHJBE( vT, vB ) \ 
{ \ 

long 1; \ 

for ( i = 3; i >= 0; 1-- ) \ 

(vT).l[i] = (long) (vB) .s[i] ; \ 

#define VUPKLSB_BE( vT, vB ) \ 
{ X 

ulong 1; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i) = (short) (vB) .c [i + 83 ; \ 

#define VUPKLSH_BE( vT, vB ) \ 
{ \ 

ulong 1 ; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).l[i] = (long) (vB) .s [i + 4] ; \ 

} 

#if defined ( LITTLE END I AN ) 
#define VUPKHSB( vT, vB ) 
tfdefine VUPKHSH( vT, vB ) 
#define VUPKLSB( vT, vB ) 
tfdefine VUPKLSH( vT, vB ) 
#else 

^define VUPKHSB( vT, vB ) 
#define VUPKHSH ( vT, vB ) 
#define VUPKLSB( vT, vB ) 
^define VUPKLSH ( vT, vB ) 
#endif 

#define VX0R( vT, vA, vB ) \ 
{ \ 

ulong 1; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] (vB).ul[i]; \ 

} 

#endif /* end BUILD^MAX */ 
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tfdefine VRSAVE_COND 7 /* recommended VR condition bit */ 

/* 

* macros to save and restore the CR register 
*/ 

fldefine SAVE CR 
fldefine REST_CR 

/* 

* macros to save and restore the LR register 
*/ 

#define SAVE LR 
tfdefine REST_LR 

/* 

* GET FPR SAVE AREA places the start of the FPR save area into a register 
+ GET_GPR_SAVE_AREA places the start of the GPR save area into a register 
* 

* For MAX only: 
* 

* GET VR_SAVE_AREA places the start of the VR save area into a register 

#define GET GPR SAVE AREA ( ptr ) \ 

ptr = (long) ( ( (ulong) gpr_save_area + 15) & -15); 

#define GET FPR SAVE AREA ( ptr ) \ 

ptr = (long) { ( (ulong) f pr_save_area + 15) & -15); 

#if defined ( BUILD MAX ) 

#define GET VR SAVE AREA ( ptr ) \ 

ptr = (long) (( (ulong) vr_save_area + 15) & -15); 
#endif 

/* 

* macros to allocate and free space on the user stack. 

* For C implementation, the size is limited to 4096 bytes. 
*/ 

tfdefine PUSH STACK ( nbytes ) \ 

sp = (long) (( (ulong) stack + 15) & -15); 

#define POP_STACK( nbytes ) \ 
sp = 0; 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
ptr = sp; 

#define FREE_STACK_SPACE ( nbytes ) P0P_STACK( nbytes ) 

^define CREATE_STACK FRAME ( nbytes ) \ 
PUSH_STACK( nbytes ) 

^define CREATE STACK FRAME X( nbytes ) \ 
CREATE_STACK_FRAME ( nbytes ) 

#define DESTROY_STACK_FRAME \ 
sp = 0; 

#define CREATE STACK BUFFER ( bufferp, byte_align, nbytes ) \ 
ALLOCATE_STACK_SPACE( bufferp, nbytes ) 

#define CREATE STACK BUFFER X( bufferp, byte_align, nbytes ) \ 
CREATE_STACK_BUFFER ( bufferp, byte^align, nbytes ) 

#define DESTROY_STACK_BUFFER \ 
sp = 0; 
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/* 

* macros to create salcache from the stack, used in ucode only 
*/ 

tfdefine CREATE STACK SALCACHE \ 

char localcachebuf fer [SALCACHE_ALLOC_SIZE] ; 

#define DESTROY_STACK_SALCACHE 



* macros for saving and restoring non-volatile 

* floating point registers (FPRs) 
*/ 

#define SAVE fl4 
tfdefine SAVE fl4 fl5 
tfdefine SAVE fl4 fl6 
tfdefine SAVE fl4 fl7 
#define SAVE fl4 f!8 
tfdefine SAVE fl4 fl9 
#define SAVE fl4 f20 
tfdefine SAVE fl4 f21 
#define SAVE f 14 f22 
#def ine SAVE f 14 f23 . 
#define SAVE fl4 f24 
ttdefine SAVE fl4 f25 
#define SAVE fl4 f26 
tfdefine SAVE fl4 f27 
tfdefine SAVE fl4 f28 
#define SAVE fl4 f29 
#define SAVE fl4 f30 
#define SAVE_fl4_f31 



#define SAVE dl4 
ttdefine SAVE dl4 dl5 
#define SAVE dl4 dl6 
tfdefine SAVE dl4 dl7 
#define SAVE dl4 dl8 
#define SAVE dl4 dl9 
#define SAVE dl4 d20 
tfdefine SAVE dl4 d21 
#define SAVE dl4 d22 
#define SAVE dl4 d23 
#define SAVE dl4 d24 
#define SAVE dl4 d2 5 
tfdefine SAVE dl4 d26 
#define SAVE dl4 d27 
#define SAVE dl4 d28 
#define SAVE dl4 d29 
#define SAVE dl4 d30 
#define SAVE_dl4_d31 

#define REST fl4 
#define REST fl4 fl5 
#define REST fl4 fl6 
#define REST fl4 fl7 
#define REST fl4 fl8 
#define REST fl4 fl9 
#define REST fl4 f20 
#define REST fl4 f21 
#define REST fl4 f22 
#define REST fl4 f23 
#define REST fl4 f24 
#define REST fl4 f25 
#define REST fl4 f26 
#define REST fl4 f27 
#define REST fl4 f28 
#define REST fl4 f29 
#define REST_fl4_f30 
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* macros for saving and restoring non-volatile 

* general purpose registers (GPRs) 
*/ 

#define SAVE rl3 
#define SAVE rl3 rl4 
#define SAVE rl3 rl5 
#define SAVE rl3 rl6 
tfdefine SAVE rl3 rl7 
#define SAVE rl3 rl8 
#define SAVE rl3 rl9 
#define SAVE rl3 r20 
#define SAVE rl3 r21 
#define SAVE rl3 r22 
#define SAVE rl3 r23 
#define SAVE rl3 r24 
tfdefine SAVE rl3 r25 
#def ine SAVE rl3 r26 
#define SAVE rl3 r27 
#define SAVE rl3 r28 
#define SAVE rl3 r29 
ttdefine SAVE rl3 r30 
#define SAVE rl3 r31 
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SAVE 


rl6 




^define 


SAVE 


Tie 


Til 


#def ine 


SAVE 


rl6 


rl8 


#def ine 


SAVE 


rl6 


rl9 


#def ine 


SAVE 


rl6 


r20 


#def ine 


SAVE 


rl6 


t21 


#def ine 


SAVE 


rl6 


t22 


#def ine 


SAVE 


rl6 


r23 


#def ine 


SAVE 


rl6 


r24 


#def ine 


SAVE 


rl6 


r25 


Jfdef ine 


SAVE 


rl6 


r26 


#def ine 


SAVE 


rl6 


r27 


#def ine 


SAVE 


rl6 


r28 


#def ine 


SAVE 


rl6 


r29 


#def ine 


SAVE 


rl6 


r30 


#def ine 


SAVE_ 


_rl6_ 


_r31 


#def ine 


REST 


rl6 




#def ine 


REST 


rl6 


Til 


#def ine 


REST 


rl6 


TlS 


#def ine 


REST 


rl6 


rl9 


#def ine 


REST 


rl6 


r20 


tfdefine 


REST 


rl6 


r21 


tfdefine 


REST 


rl6 


r22 


#def ine 


REST 


rl6 


r23 


#def ine 


REST 


rl6 


r24 


#def ine 


REST 


rl6 


r25 


#def ine 


REST 


rl6 


r26 


#def ine 


REST 


rl6 


r27 


#def ine 


REST 


rl6 


r28 


#def ine 


REST 


rl6 


r29 


#def ine 


REST 


rl6 


r30 


#def ine 


REST 


rl6 


r31 



/* 

* VMX registers 
*/ 



#def ine 


USE 


THRU 


vO ( 


cond ) 


#def ine 


USE 


THRU 


vl( 


cond ) 


#def ine 


USE 


THRU 


v2( 


cond ) 


#def ine 


USE 


THRU 


V3( 


cond ) 


#def ine 


USE 


THRU 


v4 ( 


cond ) 


^define 


USE 


THRU 


v5( 


cond ) 


#def ine 


USE 


THRU 


v6( 


cond ) 


#def ine 


USE 


THRU 


v7( 


cond ) 


#def ine 


USE 


THRU 


v8( 


cond ) 


#def ine 


USE 


THRU 


v9( 


cond ) 


#def ine 


USE 


THRU 


vlO 


; cond 


#def ine 


USE 


THRU 


vll 


! cond 


#def ine 


USE 


THRU 


vl2 


[ cond 


#def ine 


USE 


THRU 


vl3 


! cond 


#def ine 


USE 


THRU 


vl4 


! cond 


#def ine 


USE 


THRU 


vl5 


I cond 


#def ine 


USE 


THRU 


vl6 


; cond 


#def ine 


USE 


THRU 


vl7 


I cond 


tfdefine 


USE 


THRU 


vl8 


I cond 


ffdef ine 


USE 


THRU 


vl9 


; cond 


#def ine 


USE 


THRU 


v20 


[ cond 


fldef ine 


USE 


THRU 


v21 


[ cond 


#def ine 


USE 


THRU 


v22 


I cond 


#def ine 


USE 


THRU 


v23 


! cond 


#def ine 


USE 


THRU 


v24 


! cond 
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#def ine 


USE THRU v25( 


cond ] 


ttdef ine 


USE THRU v26( 


cond ] 


#def ine 


USE THRU v27( 


cond ) 


#def ine 


USE THRU v28( 


cond ] 


#def ine 


USE THRU v29( 


cond ; 


#def ine 


USE THRU v30( 


cond ! 


#def ine 


USE THRU v31 ( 


cond ! 


^define 


FREE 


THRU 


v0 ( 


cond ; 


#def ine 


FREE 


THRU 


vl ( 


cond ! 


#def ine 


FREE 


THRU 


v2 ( 


cond ] 


#def ine 


FREE 


THRU 


v3 { 


cond ; 


#def ine 


FREE 


THRU 


v4 { 


cond 1 


fldefine 


FREE 


THRU 


v5 ( 


cond 


#def ine 


FREE 


THRU 


v6 ( 


cond ! 


ffdef ine 


FREE 


THRU 


v7 ( 


cond 


#def ine 


FREE 


THRU 


v8 ( 


cond 


ffdef ine 


FREE 


THRU 


v9 ( 


cond 


#def ine 


FREE 


THRU 


vlO 


[ cond 


#def ine 


FREE 


THRU 


vll 


[ cond 


#def ine 


FREE 


THRU 


vl2 


[ cond 


#def ine 


FREE 


THRU 


vl3 


[ cond 


#def ine 


FREE 


THRU 


vl4 


[ cond 


#def ine 


FREE 


THRU 


vl5 


[ cond 


#def ine 


FREE 


THRU 


vl6 


[ cond 


#def ine 


FREE 


THRU 


vl7 


( cond 


^define 


FREE 


THRU 


vl8 


( cond 


ffdef ine 


FREE 


THRU 


vl9 


( cond 


fldef ine 


FREE 


THRU 


v20 


( cond 




FREE 


THRU 


v21 


( cond 


ttdef ine 


FREE 


THRU 


v22 


( cond 


#def ine 


FREE 


THRU 


v23 


( cond 


fldef ine 


FREE 


THRU 


v24 


( cond 


tfdef ine 


FREE 


THRU 


v25 


( cond 


#def ine 


FREE 


THRU 


v26 


( cond 


#def ine 


FREE 


THRU 


v27 


( cond 


tfdef ine 


FREE 


THRU 


v28 


( cond 


tfdef ine 


FREE 


THRU 


v29 


( cond 


#def ine 


FREE 


THRU 


v30 


( cond 


#def ine 


FREE_ 


THRU 


v31 


( cond 


#endif 










/* 
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END OF FILE salppc.h 



54 



Page No. 337 



EV 093 931 868 US 
Page No. 364 

salppc. inc 



3/9/2001 



#if ! defined! SALPPC_INC ) 
Udefine SALPPC_INC 

#if 0 

+ **************************************************************************+ 

* + * MC Standard Algorithms -- PPC Version *** 
************************************************************************** 



File Name: salppc. inc 

Description: SAL macro include file 

Source files should have extension .mac. For example, vadd.mac 
and must include this file (salppc . inc) . 

To assemble for PPC ucode, use the following basic 
makefile build rule: 

. SUFFIXES : .mac .c .s .o 

.mac.o: 

cp $*.mac $*.c 
ccmc -o $* . s -E $* .c 
ccmc -c -o $* .o $* .s 
rm - f $ * . s 
rm -f $*.c 

To compile for C, use the following basic makefile build rule: 
. SUFFIXES : .mac .c .o 



. mac . o : 

cp $*.mac $*.c 
ccmc -DCOMPILE_C 
rm - f $ * . c 



-o $* .o $* .c 



The first 8 function arguments are passed in GPR registers 
r3 - rlO. Arguments beyond 8 are passed on the stack and may 
be obtained with the GET_ARG8, GET_ARG9 , ... GET ARG15 macros. 
Additional GPR registers should be assigned in ascending order 
starting from the last function argument. These may be declared 
with the DECLARE_rx[ ry] macros. For example, a function with 
5 arguments that requires 3 additional GPR registers would 
issue: DECLARE r8 rlO. rO , if required, should be declared 
separately with the DECLARE rO macro. GPR registers above rl2 
must be saved and restored using the SAVE_rl3 [_ry] and 
REST__rl3 [_ry] macros, respectively. 

FPR registers should be assigned in ascending order starting 
with f0[d0] . These may be declared with the DECLARE_f 0 [_f y] 
or DECLARE dO [ dy] macros . 

For example, DECLARE fO fll. FPR registers above fl3[dl33 must 
be saved and restored using the SAVE f 14 t fy] and REST f 14 [_fy] 
or SAVE_dl4 [_dy] and REST_dl4 [_dy] macros, respectively. 

All variables must be assigned a register using the 
pre-processor ^define directive. GPR registers are named 
rO - r31; Single precision FPR registers are named fO - f31. 
Double precision FPR registers are named dO - d31. Different 
variables may be assigned to the same register as in: 



ttdefine vara 
tfdefine varb 



fl2 
fl2 



Functions must begin with the FUNC_PROLOG macro and end 
with the FUNC EPILOG macro. 



1 
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* 
* 


Macros are 


provided for both Fortran and C entry points. 


★ 
* 




The GET SALCACHE macro should be used to get the address of 


* 


* 
* 


the "current" 


salcache buffer into a GPR register. 


* 
★ 


* 


Avoid terminating macro lines with a semicolon. 


* 
* 


+ 
* 


The following example demonstrates typical usage: 


* 

* 


+ 
* 


#include "salppc.inc" 


* 
* 


* 


/* 






* 


* 


+ assign variables to registers 


★ 


* 


*/ 






* 




iidpf i np 


A 


r3 




★ 


lir?p» F "i np 


I 


r4 


* 


* 


iir?pf* inp 

TfUCL 111C 


B 


r5 


* 


* 


iifjpf "i np 


J 


r6 


* 


★ 


tiUt. i. ± 1 1 e 


C 


r7 




★ 


UHpf i np 


K 


r8 




* 


true JL X J 1c 


D 


r9 




★ 


H Ho f i no 

ft Ut. 1_ J. He 


L 


no 


* 


* 


H Ho 'F i no 

f+ Lie i- ± lie 


N 


rl2 




* 


4f Ho f "i no 
ft lie L J. IlC 


EFLAG rll 


* 


* 
* 


ifHpf i no 
ttuc j. ± lie 


count rll 


★ 
* 


* 


#def ine 


to 


rl3 


★ 


* 


#def ine 


tl 


rl3 


* 


* 


#def ine 


t2 


rl4 


★ 




fldefine 


t3 


rl4 


* 


★ 


#def ine 


t4 


rl5 


* 


* 


#def ine 


t5 


rl5 


* 


★ 
* 


#define 


t6 


rl6 


* 
★ 


★ 


#define 


a0 


fO 


* 


★ 


#def ine 


al 


fl 


* 


* 


#def ine 


a2 


f2 


* 


* 


#def ine 


a3 


f3 


* 


* 


#def ine 


bO 


f4 


★ 


* 


#def ine 


bl 


f5 


* 


* 


#def ine 


b2 


f6 


* 


* 


#def ine 


b3 


fl 


* 


* 


fldef ine 


cO 


IS 


* 


★ 


#def ine 


cl 


f9 


* 




#def ine 


c2 


flO 


* 




#def ine 


c3 


fll 


* 


* 


#def ine 


dO 


fl2 


* 


* 


#def ine 


dl 


fl3 


★ 


* 


#def ine 


d2 


fl4 


★ 


★ 


#def ine 


d3 


fl5 


★ 
★ 


* 
* 


FUNC_PR0L0G 


/* must precede function */ 


* 


* 


#if !defined( 


COMPILE C ) 


+ 


+ 


U ENTRY ( f OO ) 


* 


★ 


FORTRAN 


DREF 4 (I, J, K, L) 


★ 


+ 


F0RTRAN_ 


_DREF_ARG8 


+ 
* 


* 


U ENTRY (foo) 


* 


★ 


LI (EFLAG, 0) 


★ 


★ 
* 


BR (common) 




★ 
* 


* 


U ENTRY (foo x ) 


* 


* 


FORTRAN DREF 4(1, J, K, L) 




* 


FORTRAN 


DREF ARG8 


* 


★ 


FORTRAN_ 


DREF ARG9 


+ 


* 


ttendif 






* 
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ENTRY 10(foo x, A, I, 
DECLARE rl3 rl6 
DECLARE fO fl5 
GET_ARG9( EFLAG ) 

LABEL (common) 

SAVE CR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE_LR 

GET ARG8 ( N ) 



B, J # C, K, D, L, N, EFLAG) 



/* get the 9 « th arg (EFLAG) off stack */ 

/* needed if using fields 2,3 or 4 +/ 

/* needed if making a function call */ 
/* get the 8»th arg (N) off stack */ 



/* 



body of function 



REST CR 
REST rl3 rl6 
REST fl4_fl5 
REST LR 
RETURN 

FUNC EPILOG 



/* must conclude function */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 



* Revision 



Date 



Engineer; Reason 



0.0 960223 jg; Created 

0.1 970109 jfk; Added POSTING BUFFER COUNT and made 

TEST IF DCBZ macro time "stw n instead 
of doing the TEST IF DCBT macro (Iwz) 
Added SALCACHE ALLOC SIZE , 
ALIGN SALCACHE, C RE ATE_SALCACH ENFRAME 
DESTROY SALCACHE FRAME 
Added SET DCB [TZ] COND macros. 
Made old macros not assemble 
jfk; Changes SALCACHE ALLOC SIZE for 750 
+ ************************************************^ 

#endif /* header */ 



0.2 

0.3 
0.4 



970124 

970521 
980813 



jfk; 



jfk; 



#if 'defined* BUILD_603 ) && !defined( BUILD 750 ) && !defined( BUILD_MAX ) 

#error You must define BUILD_603 or BUILDJ750 or BUI LD_MAX 
#endif 



/' 



define single precision floating point field sizes, 
limits, and values 



*/ 

tfdefine F FLOAT SIZE 32 
ttdefine F FRAC SIZE 23 
fldefine F HIDDEN SIZE 1 
#define F EXP SIZE 8 
#define F SIGN SIZE 1 

tfdefine F SIGN BIT (F FLOAT SIZE - F SIGN SIZE) 
#define F EXP MASK ((1 << F EXP SIZE) - 1) 
tfdefine F EXP BIAS ((1 << (F_EXP_SIZE- 1 ) ) - 1) 
#define F MAX EXP F EXP BIAS 
tfdefine F MIN EXP (-(F EXP BIAS-1)) 



define double precision floating point field sizes, 
limits, and values 



fldefine D FLOAT SIZE 64 
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#def ine 
ffdef ine 
tfdef ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



D FRAC SIZE 52 
D HIDDEN SIZE 1 
D EXP SIZE 11 
D SIGN SIZE 1 
D SIGN BIT 
D EXP MASK 
D EXP BIAS 
D MAX EXP 
D MIN EXP 



(D FLOAT SIZE - D SIGN SIZE) 
((1 << D EXP SIZE) - 1) 
((1 << (D_EXP_SIZE-1) ) - 1) 
D EXP BIAS 
(-(D EXP_BIAS-U) 



LOG2 
LOG2 
LOG2 
LOG2 
LOG2 
LOG2 
LOG2_ 

CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 



CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE_ 

SIZE 
BSIZE 
HSIZE 
LSIZE 
FSIZE 
DSIZE 
CSIZE 
ZSIZE 



BSIZE (LOG2 CACHE SIZE) 

HSIZE (LOG2 CACHE SIZE 

LSIZE (LOG2 CACHE SIZE 

FSIZE (LOG2 CACHE SIZE 

DSIZE (LOG2 CACHE SIZE 

CSIZE (LOG2 CACHE SIZE 

ZSIZE (LOG2 CACHE SIZE 



1) 
2) 
2) 
3) 
3) 
4) 



#if defined ( BUILD_603 ) 

^define LOG2_CACHE_SIZE (14) /* Log (base 2) of 603 data cache */ 
ttelif defined ( BUILDJ750 ) | | defined { BUILD_MAX ) 

fldefine LOG2_CACHE_SIZE (15) /* Log (base 2) of 750 or MAX data cache 
*/ 

flendif 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#define 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 

#def ine 
ttdef ine 
#def ine 

#def ine 
#def ine 
#def ine 



(1 << LOG2 CACHE_SIZE) 

(CACHE SIZE) 

(CACHE SIZE » 1) 

(CACHE SIZE >> 2) 

(CACHE SIZE » 2) 

(CACHE SIZE >> 3) 

(CACHE SIZE >> 3) 

(CACHE SIZE >> 4) 



LOG2 CACHE LINE_SIZE 5 

CACHE LINE SIZE (1 << LOG2 CACHE_LINE SIZE) 
CACHE LINE LSIZE (CACHE LINE SIZE >> 2) 
CACHE LINE MASK (CACHE LINE SIZE - 1) 
CACHE_LI NE_ADDR_MASK ( Oxf f f f f f e 0 ) 

LOG2 SALCACHE ALIGN 6 

SALCACHE ALIGN (1 << LOG2 SALCACHE ALIGN) 
SALCACHE ALIGN MASK (SALCACHE ALIGN - 1) 



SALCACHE SIZE 
SALCACHE EXTRA SIZE 
SALCACHE ALLOC SIZE 



CACHE SIZE 

(SALCACHE ALIGN + 64) 

(SALCACHE SIZE + SALCACHE EXTRA SIZE) 



Define memory vector non- 
Enhanced SAL calls (final 
correspond to the vectors 
so, for example: 



cache (N) / cache (C) FLAG values for 
argument) . The letters in the symbol 
in the call, moving from left to right 



for VMULX, there are the following 8 possibilities: 



VMULX (A, I, B, J, C, K, N, SAL NNN) 

VMULX (A, I , B, J, C, K, N, SAL NNC) 

VMULX (A, I, B, J, C, K, N, SAL NCN) 

VMULX (A, I , B, J, C, K, N, SAL NCC) 

VMULX (A, I , B, J, C, K, N, SAL CNN) 

VMULX (A, I, B, J, C, K, N, SAL CNC) 

VMULX (A, I , B , J, C, K, N, SAL CCN) 



A, B, C all not in 
A, B not in cache, 

A, C not in cache, 
A not in cache, B, 

B, C not in cache, 
B not in cache, A, 
C not in cache, A, 



cache 

C in cache 



in cache 
in cache 
in cache 
in cache 
in cache 
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*/ 
/* 

* 1 vector algorithms 
*/ 

^define SAL N 0 
^define SAL C 1 



/* 

* 2 vector algorithms 



#def ine 


SAL 


NN 0 




#def ine 


SAL 


NC 1 




#def ine 


SAL 


CN 2 




#def ine 


SAL 


CC 3 




/* 








* 3 vector 
*/ 


algorithms 


ttdef ine 


SAL 


NNN 


0 


#def ine 


SAL 


NNC 


1 


#def ine 


SAL 


NCN 


2 


ttdef ine 


SAL 


NCC 


3 


#def ine 


SAL 


CNN 


4 


tfdefine 


SAL 


CNC 


5 


#def ine 


SAL 


CCN 


6 


#def ine 


SAL 


CCC 


7 


/* 








* 4 vector 
*/ 


algorithms 


#def ine 


SAL 


NNNN 


0 


#def ine 


SAL 


NNNC 


1 


#def ine 


SAL 


NNCN 


2 


#def ine 


SAL 


NNCC 


3 


#def ine 


SAL 


NCNN 


4 


#def ine 


SAL 


NCNC 


5 


#def ine 


SAL 


NCCN 


6 


#def ine 


SAL 


NCCC 


7 


#def ine 


SAL 


CNNN 


8 


#def ine 


SAL 


CNNC 


9 


#def ine 


SAL 


CNCN 


10 


#def ine 


SAL 


CNCC 


11 


#def ine 


SAL 


CCNN 


12 


#def ine 


SAL 


CCNC 


13 


#def ine 


SAL 


CCCN 


14 


#def ine 


SAL 


CCCC 


15 


/* 








* 5 vector 

* / 


algorithms 


#def ine 


SAL 


NNNNN 


0 


#def ine 


SAL 


NNNNC 


1 


#def ine 


SAL 


NNNCN 


2 


#def ine 


SAL 


NNNCC 


3 


#def ine 


SAL 


NNCNN 


4 


#def ine 


SAL 


NNCNC 


5 


#def ine 


SAL 


NNCCN 


6 


#def ine 


SAL 


NNCCC 


7 


#def ine 


SAL 


NCNNN 


8 


ffdef ine 


SAL 


NCNNC 


9 


#def ine 


SAL 


NCNCN 


10 


#def ine 


SAL 


NCNCC 


11 


#def ine 


SAL 


NCCNN 


12 


#def ine 


SAL 


NCCNC 


13 


#def ine 


SAL 


NCCCN 


14 



5 



Page No. 342 



EV 093 931 868 US 



salppc. inc 






tfdef ine 


SAL 


NCCCC 


15 


#def ine 


SAL 


CNNNN 


16 


#def ine 


SAL 


CNNNC 


17 


#def ine 


SAL 


CNNCN 


18 


#def ine 


SAL 


CNNCC 


19 


^define 


SAL 


CNCNN 


20 


#def ine 


OAT 

SAL 


CNCNC 


21 


#def ine 


SAL 


CNCCN 


22 


^define 


SAL 


CNCCC 


23 


#def ine 


SAL 


CCNNN 


24 


#def ine 


SAL 


CCNNC 


25 


#def ine 


SAL 


CCNCN 


26 


#def ine 


SAL 


CCNCC 


27 


#def ine 


SAL 


CCCNN 


28 


#def ine 


SAL 


CCCNC 


29 


#def ine 


SAL 


CCCCN 


30 


#def ine 


SAL 


CCCCC 


31 
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define byte offsets into FFT_setup_ppc603e 



#def ine 


FFT 


SETUP 


HANDLE 


0 


#def ine 


FFT 


SETUP 


SMALL TWIDP 


4 


#def ine 


FFT 


SETUP 


SMALL BITR TWIDP 


8 


#def ine 


FFT 


SETUP 


SMALL L0G2M 


12 


#def ine 


FFT 


SETUP 


BIG TWIDP 


16 


#def ine 


FFT 


SETUP 


BIG XY TWIDP 


20 


#def ine 


FFT 


SETUP 


BIG LOG 2 MX Y 


24 


#def ine 


FFT 


SETUP 


BIG L0G2X 


28 


#def ine 


FFT 


SETUP 


BIG L0G2Y 


32 


#def ine 


FFT 


SETUP 


BIG STRIPX 


36 


#def ine 


FFT 


SETUP 


RPASS TWIDP 


40 


#def ine 


FFT 


SETUP 


RADIX3 TWIDP 


44 


#def ine 


FFT 


SETUP 


RADIX5 TWIDP 


48 


#def ine 


FFT 


SETUP 


L0G2M 


52 


#def ine 


FFT 


SETUP 


LOG 2 MR 


56 


#def ine 


FFT 


SETUP 


VMX BITR TWIDP 


60 


#def ine 


FFT_ 


_SETUP_ 


VMX TABLES 


64 


/* 











* ASIC equates 
*/ 

#define ASIC_H 

#define PREFETCH CONTROL 

#define PREFETCH CONTROL H 

# de f i ne PREFETCH_CONTROL_L 

tfdefine MISCON B 

tfdefine MISCON B H 

#define MISCON_B_L 

^define PREFETCH DISABLED 

tfdefine PREFETCH AUTO 6 

tfdefine PREFETCH AUTO 5 

#define PREFETCH AUTO 4 

^define PREFETCH AUTO 3 

#define PREFETCH AUTO 2 

#define PREFETCH AUTO 1 

#define PREFETCH AUTO 0 



-1024 

(OxFBFFFEOO) 
-1024 

-512 

(0XFBFFFC18) 

-1024 

-1000 

0 

1 

2 
3 
4 
5 
6 
7 



/* { OxFBFF + 1) */ 



/* (OxFBFF + 1) */ 
/* (OxFEOO) */ 



/* (OxFBFF + 1) */ 
/* (0xFC18) */ 



#define PREFETCH MANUAL 0 8 

#define PREFETCH MANUAL 2 9 

^define PREFETCH MANUAL 4 10 

#define PREFETCH MANUAL 6 11 

#define PREFETCH MANUAL 8 12 

#define PREFETCH MANUAL 10 13 
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#def ine 


PREFETCH 


MANUAL 12 


14 












U An. t A. 

8aet ine 


PREFETCH^ 


_MANUAL_14 


15 












ttuef ine 


USE PREFETCH CONTROL 16 










ftdef ine 


USE_MISCON_B 




0 










tfdef ine 


PRFFFTCH 


MASK 


15 












ftdef ine 




DEFAULT 


(USE 


PREFETCH 






MANUAL 


0) 


#def ine 


PREFETCH 


OFF 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


DISABLED) 


^define 


PREFETCH 


A6 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


AUTO 6) 




#def ine 


PREFETCH 


A5 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


AUTO 5) 




#def ine 


PREFETCH 


A4 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


AUTO 4) 




ttdef ine 


PREFETCH 


A3 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


AUTO 3) 




#de f ine 


PREFETCH 


A2 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


AUTO 2) 




#def ine 


PREFETCH 


Al 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


AUTO 1) 




#def ine 


PREFETCH_ 


_A0 


(USE_ 


_PREFETCH_ 


CONTROL 


PREFETCH^ 


_AUTO_0) 




#def ine 


PREFETCH 


MO 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


0) 


#def ine 


PREFETCH 


M2 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


2) 


ttdef ine 


PREFETCH 


M4 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


4) 


#def ine 


PREFETCH 


M6 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


6) 


#def ine 


PREFETCH 


M8 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


8) 


#def ine 


PREFETCH 


M10 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


10) 


#def ine 


PREFETCH 


M12 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


12) 


ttdef ine 


PREFETCH 


M14 


(USE 


PREFETCH 


CONTROL 


PREFETCH 


MANUAL 


14) 
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/* 

* macro to compile for PPC assembly (COMPILE_C *not* defined) or 

* C code (COMPILER defined) 
*/ 

tfif defined ( COMPILE_C ) 



#include "salppc. h n 



#else 
/* 



* GPR 


register equates 


*/ 






#def ine 


rO 


0 


#def ine 


sp 


1 


#def ine 


rtoc 


2 


#def ine 


r3 


3 


^define 


r4 


4 


#def ine 


r5 


5 


#def ine 


r6 


6 


#def ine 


r7 


7 


#def ine 


r8 


8 


#def ine 


r9 


9 


#def ine 


rlO 


10 


#def ine 


rll 


11 


#def ine 


rl2 


12 


#def ine 


rl3 


13 


#def ine 


rl4 


14 


#def ine 


rl5 


15 


#def ine 


rl6 


16 


fldefine 


rl7 


17 


#def ine 


rl8 


18 


#def ine 


rl9 


19 


#define 


r2 0 


20 


#def ine 


r21 


21 


#def ine 


r22 


22 


#def ine 


r2 3 


23 


#def ine 


r24 


24 


#def ine 


r25 


25 


#def ine 


r26 


26 
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#def ine 


r27 


27 


#def ine 


r28 


28 


#def ine 


r29 


29 


#def ine 


r30 


30 


tfdef ine 


r31 


31 


/* 






* FPR single pi 


*/ 






#def ine 


fO 


0 


#def ine 


fl 


1 


tfdef ine 


f2 


2 


tfdef ine 


f3 


3 


#def ine 


f4 


4 


#def ine 


f5 


5 


#def ine 


f 6 


6 


#def ine 


f 7 


7 


fjdef ine 


f8 


8 


#def ine 


f 9 


9 


#def ine 


f 10 


10 


#def ine 


f 11 


11 


fldefine 


f 12 


12 


#def ine 


fl3 


13 


#def ine 


f 14 


14 


#def ine 


fl5 


15 


^define 


fl6 


16 


^define 


fl7 


17 


#def ine 


fl8 


18 


#def ine 


fl9 


19 


^define 


f20 


20 


#def ine 


f21 


21 


#def ine 


f22 


22 


#define 


f23 


23 


#def ine 


f24 


24 


#def ine 


f25 


25 


#def ine 


f26 


26 


#def ine 


f27 


27 


#def ine 


f 28 


28 


#def ine 


f 29 


29 


#def ine 


f30 


30 


#def ine 


f31 


31 



/* 

* FPR double precision register equates 



#def ine 


dO 


0 


^define 


dl 


1 


#def ine 


d2 


2 


^define 


d3 


3 


^define 


d4 


4 


#def ine 


d5 


5 


#def ine 


d6 


6 


#def ine 


d7 


7 


#def ine 


d8 


8 


#def ine 


d9 


9 


fldef ine 


dlO 


10 


#def ine 


dll 


11 


ffdef ine 


dl2 


12 


#def ine 


dl3 


13 


Jfdef ine 


dl4 


14 


^define 


dl5 


15 


^define 


dl6 


16 


#def ine 


dl7 


17 


#def ine 


dl8 


18 


#def ine 


dl9 


19 


^define 


d20 


20 


#def ine 


d21 


21 
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#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdef ine 
tfdef ine 
#def ine 
#def ine 
^define 



622 
d23 
d24 
d25 
d26 
d27 
d28 
d29 
d30 
d31 



22 
23 
24 
25 
26 
27 
28 
29 
30 
31 



#if defined ( BUILD_MAX ) 
/* 



* VMX 


(g4) 


reg: 


*/ 






#def ine 


v0 


0 


#def ine 


vl 


1 


#def ine 


v2 


2 


#def ine 


v3 


3 


#def ine 


v4 


4 


#def ine 


v5 


5 


#def ine 


v6 


6 


#def ine 


v7 


7 


#def ine 


v8 


8 


#def ine 


v9 


9 


#define 


vlO 


10 


#def ine 


vll 


11 


#def ine 


vl2 


12 


ttdef ine 


vl3 


13 


ftdefine 


vl4 


14 


#def ine 


vl5 


15 


fldefine 


vl6 


16 


#def ine 


vl7 


17 


tfdefine 


vl8 


18 


#def ine 


vl9 


19 


#def ine 


v20 


20 


#def ine 


v21 


21 


#def ine 


v22 


22 


#def ine 


v23 


23 


#def ine 


v24 


24 


#def ine 


v2 5 


25 


#def ine 


v26 


26 


#def ine 


v27 


27 


#def ine 


v28 


28 


#def ine 


v29 


29 


#def ine 


v30 


30 


#def ine 


v31 


31 



3/9/2001 



#endif 



#define FUNC PROLOG \ 
.section .text; \ 
.align 5; 

^define FUNC_EPILOG 

#define TEXT SECTION ( logb2_align ) \ 
.section .text; \ 
.align logb2_align; 

#define DATA SECTION ( logb2_align ) \ 
.section .data; \ 
.align logb2_align; 

#define RODATA SECTION { logb2_align ) \ 
.section .rodata; \ 
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.align logb2_align; 

#define PC_OFFSET{ nbytes ) (. + (nbytes) ) 
/* 

* make a "double" concat to fool the preprocessor so that input 

* arguments get translated before concatenation; otherwise, the 

* concatenated symbol doesn't get translated properly 
*/ 

tfdefine CONCAT ( left, right ) CONCAT NEST ( left, right ) 
tfdefine CONCAT_NEST( left, right ) left##right 

/* 

* macro for extern declarations and definitions 
*/ 

#define EXTERN_DATA( symbol ) 
tfdefine EXTERN_FUNC( func ) 
/* 

* macro for a global declaration 
*/ 

tfdefine GLOBAL ( symbol ) \ 
.globl symbol 

/* 

* macro for a local declaration 
*/ 

tfdefine LOCAL ( symbol ) 
/* 

* macros for creating static arrays 
*/ 

ttdefine START_ARRAY ( name ) \ 
nameS #: 

#define START C ARRAY ( name ) START ARRAY ( name ) 
tfdefine START UC ARRAY ( name ) START ARRAY ( name ) 
#define START S ARRAY ( name ) START ARRAY ( name ) 
#define START US ARRAY ( name ) START ARRAY ( name ) 
#define START L ARRAY ( name ) START ARRAY ( name ) 
#define START UL ARRAY ( name ) START ARRAY { name ) 
#define START_F_ARRAY ( name ) START_ARRAY ( name ) 

#define END_ARRAY 

#define DATA ( type, dl ) \ 

.##type dl 

#define DATA2 ( type, dl, d2 ) \ 

.##type dl, d2 

#define DATA4 ( type, dl, d2, d3, d4 ) \ 

.##type dl, d2, d3, d4 

tfdefine DATA8 ( type, dl, d2, d3, d4 , d5, d6, d7 , d8 ) \ 

.##type dl, d2, d3, d4 , d5, d6, d7, d8 

tfdefine C DATA ( dl ) DATA ( byte, dl ) 

tfdefine UC DATA ( dl ) DATA ( byte, dl ) 

^define S DATA ( dl ) DATA ( short, dl ) 

#define US DATA ( dl ) DATA ( short, dl ) 

tfdefine L DATA ( dl ) DATA { long, dl ) 

#define UL DATA ( dl ) DATA ( long, dl ) 

#define F_DATA( dl ) DATA ( float, dl ) 

#if defined ( LITTLE END I AN ) 
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fldefine D_DATA ( dl , d2 ) DATA2 < long, d2, dl ) 
#else 

tfdefine D_DATA( dl , d2 ) DATA2 ( long, dl, d2 ) 
flendif 



fldefine C DATA2 ( dl, d2 ) 
^define UC DATA2 ( dl, d2 ) 
#define S DATA2 ( dl , d2 ) 
tfdefine US DATA2 ( dl, d2 ) 
ftdefine L DATA2 ( dl , d2 ) 
^define UL DATA2 ( dl, d2 ) 
^define F DATA 2 ( dl, d2 ) 



DATA2 ( byte, dl , d2 ) 
DATA2 ( byte, dl, d2 ) 
DATA2 ( short, dl, d2 ) 
DATA2 ( short, dl, d2 ) 
DATA2 ( long, dl , d2 ) 
DATA2 ( long, dl , d2 ) 
DATA2 ( float, dl , d2 ) 



^define C DATA4 ( dl , d2, d3, d4 ) 
tfdefine UC DATA4 ( dl, d2, d3, d4 ) 
tfdefine S DATA4 ( dl, d2, d3, d4 ) 
fldefine US DATA4 ( dl, d2, d3, d4 ) 
^define L DATA4 ( dl, d2, d3, d4 ) 
fldefine UL DATA4 { dl, d2, d3, d4 ) 
tfdefine F DATA4 ( dl, d2, d3, d4 ) 



DATA4 ( byte, dl, d2, d3, d4 ) 
DATA4 ( byte, dl, d2 , d3 , d4 ) 
DATA4 ( short, dl, d2, d3, d4 ) 
DATA4 ( short, dl, d2, d3, d4 ) 
DATA4 ( long, dl, d2 , d3 , d4 ) 
DATA4 ( long, dl, d2, d3, d4 ) 
DATA4 ( float, dl, d2, d3, d4 ) 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 



C DATA 8 ( dl, 
DATA8 ( byte, 
UC DATA8 ( dl 
DATA8 ( byte, 
S DATA 8 ( dl, 
DATA 8 ( short 
US DATA 8 ( dl 
DATA8 ( short 
L DATA 8 ( dl, 
DATA8 ( long, 
UL DATA8 ( dl 
DATA 8 ( long, 
F DATA 8 ( dl, 
DATA 8 ( float 



d2, d3, 
dl, d2, 
, d2, d3, 
dl, d2, 
d2, d3, 
, dl, d2, 
, d2, d3, 
, dl, d2, 
d2, d3, 
dl, d2, 
, d2, d3, 
dl, d2, 
d2, d3, 
, dl, d2, 



d4, d5, 
d3, d4, 

d4, d5, 
d3, d4, 
d4, d5, 
d3, d4, 
d4, d5, 
d3, d4, 
d4, d5, 
d3, d4, 

d4, d5, 
d3, d4, 
d4, d5, 
d3, d4, 



d6, d7, 
d5, d6, 
d6, d7 
d5, d6, 
d6, d7, 
d5, d6 
d6, d7 
d5, d6 
d6, d7, 
d5, d6, 
d6, d7 
d5, d6, 
d6, d7, 
d5, d6 



d8 ) \ 

d7, d8 ) 
, d8 ) \ 

d7, d8 ) 

d8 ) \ 
, d7, d8 ) 
, d8 ) \ 
, d7, d8 ) 

d8 ) \ 

d7, d8 ) 
, d8 ) \ 

d7, d8 ) 

d8 ) \ 
, d7, d8 ) 



* macros for creating vmx permute masks (128-bits) 
*/ 

#if defined ( LITTLE END IAN ) 



#def ine 


L 


PERMUTE MUNGE ( 


1 


) 


( CD 


Oxlclclclc 


#def ine 


S 


PERMUTE MUNGE ( 


s 


) 


( (s) 


A Oxlele ) 


#def ine 


C_ 


_PERMUTE_MUNGE ( 


c 


) 


( (c) 


* Oxlf ) 


#def ine 


L 


INDEX MUNGE ( x 


) 


( 


(x) * 


0x3 ) 


#def ine 


S 


INDEX MUNGE ( x 


) 


( 


(x) A 


0x7 ) 


#def ine 


c 


INDEX MUNGE { x 


) 


( 


(x) * 


Oxf ) 



tfelse 

ttdefine L PERMUTE 
tfdefine S PERMUTE 
Jfdefine C_PERMUTE 



MUNGE ( 1 ) ( 1 ) 
MUNGE ( s ) ( S ) 
MUNGE ( C ) ( C ) 



#define L INDEX 
tfdefine S INDEX 
#define C INDEX 



MUNGE ( x ) { x ) 
MUNGE ( x ) { x ) 
MUNGE ( x ) { x ) 



Hendif 



tfdefine L PERMUTE MASK { 11, 12, 13, 14 ) \ 
.long L PERMUTE MUNGE { 11 ) , L PERMUTE MUNGE { 12 ) , \ 
L_PERMUTE_MUNGE ( 13 ), L_PERMUTE_MUNGE ( 14 ) 

#define S PERMUTE MASK ( si, s2, s3, s4 , S5, s6, s7, s8 ) \ 
.short S PERMUTE_MUNGE ( si ), S_PERMUTE_MUNGE ( s2 ) , \ 
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S PERMUTE MUNGE ( s3 ) , S PERMUTE MUNGE ( s4 ) , \ 
S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE ( s6 ) , \ 
S PERMUTE MUNGE ( s7 ) , S PERMUTE MUNGE { s8 ) 



.byte 



ffdefine C_PERMUTE_MASK ( cl, 

c9, 

PERMUTE MUNGE ( Cl ) , 
PERMUTE MUNGE ( c3 ), 
PERMUTE MUNGE ( c5 ) , 
PERMUTE MUNGE ( c7 ) , 
PERMUTE MUNGE ( c9 ), 
PERMUTE MUNGE { ell ) 
PERMUTE MUNGE ( cl3 ) 
PERMUTE MUNGE ( cl5 ) 



C8, \ 
Cl5, 

\ 
\ 
\ 
\ 

\ 

C PERMUTE MUNGE ( Cl2 ), \ 
C PERMUTE MUNGE ( cl4 ), \ 
C PERMUTE MUNGE ( Cl6 ) 



c2, c3, c4, c5, c6, cl, 
clO, ell, cl2, cl3, cl4 
C PERMUTE MUNGE ( c2 ) , 
C PERMUTE MUNGE ( C4 ) , 
C PERMUTE MUNGE ( c6 ) , 
C PERMUTE MUNGE ( c8 ), 
C PERMUTE MUNGE ( ClO ) 



C16 ) \ 



* macro for a microcode entry point (e.g. 

* U ENTRY is a "nop" for C code 
*/ 

#define U ENTRY ( func_name ) \ 
.globl func_name; \ 
func name: 



vaddx, vaddx ) 



/* 

* macros 

*/ 
#define 
tfdefine 
#define 
ftdefine 
#def ine 
tfdefine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



for C function prototypes 



PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 
PROTOTYPE 



0 ( func name 
1 ( func name 
2 ( func name 
3 ( func name 
4 ( func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 
func name 



5( 

6( 

7( 

8( 

9{ 

10( 

IK 

12( 

13( 

14 ( 

15( 

16( 



/* 

* macros for C and Fortran callable entry points 
*/ 

#define ENTRY 0( func_name ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 1( func_name, argO ) \ 
.globl func_name; \ 
f unc_name : 

tfdefine ENTRY 2{ func_name, argO, argl ) \ 
.globl func_name; \ 
f unc_name : 

tfdefine ENTRY 3( func^ame, argO, argl, arg2 ) \ 
.globl func_name; \ 
f unc_name : 

^define ENTRY 4( func_name, argO, argl, arg2, arg3 ) \ 
.globl func_name; \ 
func name: 
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#define ENTRY 5( func_name, argO, argl, arg2, arg3, arg4 ) \ 
.globl func_name; \ 
f unc_name : 

fldefine ENTRY 6( func_name, argO, argl, arg2, arg3, arg4 , arg5 ) \ 
.globl func_name; \ 
f unc_name : 

fldefine ENTRY_7 ( func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6 ) \ 

.globl func_name; \ 
func_name: 

ttdefine ENTRY_8 ( func_name # argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6, arg7 ) \ 

.globl func_name; \ 
f unc_name: 

#define ENTRY_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8 ) \ 

.globl func__name; \ 
f unc_name : 

#define ENTRY_10( func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6, arg7, arg8, arg9 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_11{ func_name, argO, argl, arg2 # arg3, arg4 , arg5, \ 

arg6, arg7, arg8, arg9, arglO ) \ 

.globl func_name; \ 
f unc_name : 

tfdefine ENTRY_12 ( func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6, arg7, arg8 , arg9, arglO, argil ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_13( func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_14 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8 , arg9, arglO, argil, \ 
argl2, argl3 ) \ 

.globl func_name; \ 
func_name: 

tfdefine ENTRY_15( func_name, argO, argl, arg2 , arg3 , arg4 , arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_16( func_name, argO, argl, arg2, arg3 , arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 , argl5 ) \ 

.globl func_name; \ 
func_name: 

/* 

* macros to de- reference any set of the first 8 arguments 

* passed by reference to the Fortran entry point but by 

* value to the corresponding C entry point 
*/ 
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#def ine FORTRAN DREF 1 ( argO ) \ 
lwz argO, O(argO); 

#define FORTRAN DREF 2( argO, argl ) \ 
lwz argO, 0 (argO) ; \ 
lwz argl , 0 (argl) ; 

#define FORTRAN DREF 3( argO, argl, arg2 ) \ 
lwz argO, O(argO); \ 
lwz argl, O(argl); \ 
lwz arg2, 0 (arg2) ,- 

tfdefine FORTRAN DREF 4( argO, argl, arg2, arg3 ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0(arg2); \ 
lwz arg3, 0 (arg3) ; 

#define FORTRAN DREF 5( argO, argl, arg2, arg3, arg4 ) \ 
lwz argO, O(argO); \ 
lwz argl, O(argl); \ 
lwz arg2, 0 (arg2) ; \ 
lwz arg3, 0(arg3); \ 
lwz arg4 , 0 (arg4 ) ; 

^define FORTRAN DREF 6( argO, argl, arg2, arg3, arg4, arg5 ) \ 

lwz argO, O(argO); \ 

lwz argl, O(argl); \ 

lwz arg2, 0 (arg2) ; \ 

lwz arg3, 0 (arg3) ; \ 

lwz arg4, 0 (arg4) ; \ 

lwz arg5, 0 <arg5) ; 

#define FORTRAN DREF 7 ( argO, argl, arg2, arg3, arg4 , arg5, arg6 ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0 (arg2) ; \ 
lwz arg3, 0 (arg3) ; \ 
lwz arg4, 0(arg4); \ 
lwz arg5, 0(arg5); \ 
lwz arg6, 0 (arg6) ; 

^define FORTRAN DREF 8( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ) \ 
lwz argO, O(argO); \ 
lwz argl, O(argl); \ 
lwz arg2, 0(arg2); \ 
lwz arg3, 0 (arg3) ; \ 
lwz arg4 , 0 (arg4) ; \ 
lwz arg5, 0(arg5); \ 
lwz arg6, 0 (arg6) ; \ 
lwz arg7, 0 (arg7) ; 

/* 

* macros to de-reference specific arguments beyond the first 8 

* passed by value to the C entry point 
*/ 

#define ARG_OFF (8 - 8*4) 

#define FORTRAN DREF_ARG8 \ 

lwz rl2, (ARG OFF + 8*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 8*4) (sp) ; 

#define FORTRAN DREF_ARG9 \ 

lwz rl2, (ARG OFF + 9*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARG_OFF + 9*4) (sp) ; 
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#define FORTRAN DREF_ARG10 \ 

lwz rl2, (ARG OFF + 10*4) (sp) ; \ 

Iwz rl2, 0(rl2) ; \ 

stw r!2, (ARG_OFF + 10*4) (sp) ; 

tfdefine FORTRAN DREF_ARG11 \ 

lwz rl2, (ARG OFF + 11*4) (sp) ; \ 

lwz r!2, 0(rl2) ; \ 

Stw rl2, (ARG_OFF + 11*4) (sp); 

tfdefine FORTRAN DREF_ARG12 \ 

lwz rl2, (ARG OFF + 12*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_0FF + 12*4) (sp) ; 

#define FORTRAN DREF_ARG13 \ 

lwz rl2, (ARG OFF + 13*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARG_OFF + 13*4) (sp) ; 

#define FORTRAN DREF_ARG14 \ 

lwz rl2, (ARG OFF + 14*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 14*4) (sp) ; 

tfdefine FORTRAN DREF_ARG15 \ 

lwz rl2, (ARG OFF + 15*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 15*4) (sp) ; 

#define FORTRAN DREF_ARG16 \ 

lwz rl2, (ARG OFF + 16*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 16*4) (sp) ; 

#define FORTRAN DREF_ARG17 \ 

lwz rl2, (ARG OFF + 17*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARG_OFF + 17*4) (sp) ; 



/* 

* macros to get GPR arguments beyond 8 



#def ine 


GET ARG 8 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


8*4) (sp) ; 


#def ine 


GET ARG9 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


9*4) (sp) ; 


#def ine 


GET ARG10( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


10*4) (sp) 


#def ine 


GET ARG11 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


11*4) (sp) 


#def ine 


GET ARG12 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


12*4) (sp) 


#def ine 


GET ARG13 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


13*4) (sp) 


#def ine 


GET ARG14 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


14*4) (sp) 


#def ine 


GET ARG 15 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


15*4) (sp) 


#def ine 


GET ARG16( 


rD ) 


lwz 


rD, 


(ARG 


OFF 


+ 


16*4) (sp) 


#def ine 


GET ARG17 ( 


rD ) 


lwz 


rD, 


(ARG 


OFF 




17*4) (sp) 



/* 

* macros to set GPR arguments beyond 8 



#def ine 


SET ARG8 ( 


rD ) 




stw rD, 


(ARG 


OFF 


+ 


8*4) (sp) ; 


#def ine 


SET ARG9 ( 


rD ) 




stw 


rD, 


(ARG OFF 


+ 


9*4) (sp) ; 


#def ine 


SET ARG10( 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


10*4) (sp) ; 


#def ine 


SET ARG11 ( 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


11*4) (sp) ; 


#def ine 


SET ARG12 ( 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


12*4) (sp) ; 


#def ine 


SET ARG13 { 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


13*4) (sp) ; 


#define 


SET ARG14 ( 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


14*4) (sp) ; 


tfdefine 


SET ARG15( 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


15*4) (sp) ; 


#def ine 


SET ARG16 ( 


rD 


) 


stw 


rD, 


(ARG 


OFF 


+ 


16*4) (sp) ; 
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fldefine SET_ARG17 ( rD ) stw rD f (ARG_OFF + 17*4) (sp) ; 

/* 

* macro to branch from one entry point to another 
*/ 

#define BR FUNC( func_name ) \ 
b func_name; 

/* 

* macros to call functions 
*/ 

#define CALL FUNC( func_name ) \ 
bl func_name; 

#def ine CALL 0 ( f unc name ) \ 
CALL_FUNC ( func_name ) 

ttdefine CALL 1( func name, argO ) \ 
CALL_FUNC ( f unc_name ) 

tfdefine CALL 2( func name, argO, argl ) \ 
CALL_FUNC ( func_name ) 

#define CALL 3( func name, argO, argl, arg2 ) \ 
CALL_FUNC ( f unc_name ) 

tfdefine CALL 4{ func name, argO, argl, arg2, arg3 ) \ 
CALL_FUNC { f unc_name ) 

#define CALL 5{ func name, argO, argl, arg2, arg3, arg4 ) \ 
CALL_FUNC( func_name ) 

^define CALL 6( func name, argO, argl, arg2, arg3, arg4 , arg5 ) \ 
CALL_FUNC( func_name ) 

^define CALL 7( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6 ) \ 
CALL_FUNC ( func_name ) 

#define CALL 8( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7 ) \ 
CALL_FUNC( func_name ) 

tfdefine CALL_9 ( func_name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
arg8 ) \ 
CALL_FUNC( func_name ) 

fldefine CALL_10 ( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
arg8, arg9 ) \ 
CALL_FUNC( func_name ) 

#define CALL_11( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO ) \ 
CALL_FUNC( func_name ) 

tfdefine CALL_12 ( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil ) \ 
CALL_FUNC( func_name ) 

#define CALL_13( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 ) \ 
CALL_FUNC( func_name ) 

#define CALL_14 { func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3 ) \ 
CALL_FUNC( func_name ) 

fldefine CALL_15( func_name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
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arg8 f arg9 f arglO, argil, argl2, argl3, argl4 ) \ 
CALL FUNC( func name ) 



tfdefine CALL_16( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4, arglS ) \ 
CALL_FUNC ( func_name ) 

#if defined ( BUILD MAX ) 

#if defined ( COMPILE ESAL JUMP TABLE ) 



* G4 macros to create an ESAL jump table for 1, 2, 3 and 4 vector 

* algorithms. The table name is <root_name>_jump and is made a 

* local symbol, (not supported in C) 
*/ 

tfdefine DECLARE VMX_V1 ( root_name ) \ 
.section .rodata; \ 
.align 5; \ 

CONCAT ( root name, jump ) : \ 
.long CONCAT( root name, n ); \ 
.long C0NCAT( root_name, _c ) ; 

#def ine DECLARE VMX_V2 ( root_name ) \ 
.section .rodata; \ 
.align 5; \ 

CONCAT( root name, jump): \ 
.long CONCAT( root name, nn ); \ 
.long CONCAT ( root name, nc ) ; \ 
.long CONCAT ( root name, cn ) ; \ 
.long CONCAT ( root_name, _cc ); 

#def ine DECLARE VMX_V3 ( root_name ) \ 
.section .rodata; \ 



.align 


5; \ 










CONCAT 


root name, 


jump ) 


: \ 




. long 


CONCAT { 


root 


name, 


nnn ) 


* \ 


. long 


CONCAT ( 


root 


name, 


nnc ) 


• \ 


. long 


CONCAT ( 


root 


name, 


ncn ) , 


■ \ 


- long 


CONCAT ( 


root 


name, 


ncc ) 


• \ 


. long 


CONCAT ( 


root 


name, 


cnn ) 


• \ 


- long 


CONCAT ( 


root 


name, 


cnc ) 


■ \ 


. long 


CONCAT ( 


root 


name, 


ccn ) 


* \ 


. long 


CONCAT ( 


root_ 


jiame , 


_ccc ) 




ttdefine DECLARE VMX_ 


JV4 { root_name ) 



.section .rodata; \ 
.align 5; \ 



root name, 


jump 


): \ 




CONCAT ( 


root 


name, 


nnnn ) 


\ 


CONCAT ( 


root 


name, 


nnnc ) 


\ 


CONCAT ( 


root 


name, 


nncn ) 


\ 


CONCAT ( 


root 


name, 


nncc ) 


\ 


CONCAT ( 


root 


name, 


ncnn ) 


• \ 


CONCAT ( 


root 


name, 


ncnc ) 


\ 


CONCAT ( 


root 


name, 


nccn ) 


\ 


CONCAT ( 


root 


name, 


nccc ) 


\ 


CONCAT ( 


root 


name, 


cnnn ) 


\ 


CONCAT ( 


root 


name, 


cnnc ) 


\ 


CONCAT ( 


root 


name, 


cncn ) 


\ 


CONCAT ( 


root 


name, 


cncc ) 


\ 


CONCAT ( 


root 


name, 


ccnn ) 


\ 


CONCAT ( 


root 


name, 


ccnc ) 


\ 


CONCAT ( 


root 


name, 


cccn ) 


\ 


CONCAT ( 


root_name , 


_cccc ) 





#define DECLARE VMX_V5 ( root_name ) \ 
.section .rodata; \ 
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. align 


5; \ 




jump ) 


: \ 




CONCAT ( 


root name, 


\ 


. long 


CONCAT ( 


root 


name, 


nnnnn ) , 


• long 


CONCAT ( 


root 


name, 


nnnnc ) , 


\ 


. long 


CONCAT ( 


root 


name, 


nnncn ) , 


\ 


. long 


CONCAT ( 


root 


name, 


nnncc ) , 


\ 


- long 


CONCAT ( 


root 


name, 


nncnn ) , 


\ 


. long 


CONCAT ( 


root 


name, 


nncnc ) 


\ 


. long 


CONCAT ( 


root 


name, 


nnccn ) 


\ 


. long 


CONCAT { 


root 


name, 


nnccc ) 


\ 


. long 


CONCAT ( 


root 


name, 


ncnnn ) 


\ 


. long 


CONCAT ( 


root 


name, 


ncnnc ) 


\ 


. long 


CONCAT ( 


root 


name, 


ncncn ) 


\ 


. long 


CONCAT { 


root 


name, 


ncncc ) 


• \ 


.long 


CONCAT ( 


root 


name, 


nccnn ) 


• \ 


. long 


CONCAT ( 


root 


name, 


nccnc ) 


• \ 


. long 


CONCAT ( 


root 


name, 


ncccn ) 


• \ 


. long 


CONCAT ( 


root 


name, 


ncccc ) 


• \ 


. long 


CONCAT ( 


root 


name, 


cnnnn ) 


• \ 


. long 


CONCAT ( 


root 


name, 


cnnnc ) 


• \ 


- long 


CONCAT ( 


root 


name, 


cnncn ) 


• \ 


. long 


CONCAT ( 


root 


name, 


cnncc ) 


• \ 


. long 


CONCAT ( 


root 


name, 


cncnn ) 


; \ 


.long 


CONCAT ( 


root 


name, 


cncnc ) 


; \ 


- long 


CONCAT ( 


root 


name, 


cnccn ) 


; \ 


.long 


CONCAT ( 


root 


name, 


cnccc ) 


; \ 


-long 


CONCAT ( 


root 


name, 


cchnn ) 


; \ 


. long 


CONCAT ( 


root 


name, 


ccnnc ) 


: \ 


. long 


CONCAT ( 


root 


name, 


ccncn ) 


; \ 


. long 


CONCAT ( 


root 


name, 


ccncc ) 


; \ 


. long 


CONCAT ( 


root 


name, 


cccnn ) 


; \ 


. long 


CONCAT ( 


root 


name, 


cccnc ) 


; \ 


. long 


CONCAT ( 


root 


name, 


ccccn ) 


; \ 


- long 


CONCAT ( 


root 


name , 


_ccccc ) 





#define DECLARE VMX Zl ( root name ) 
#define DECLARE VMX Z2 ( root name ) 
#define DECLARE VMX Z3 ( root name ) 
#def ine DECLARE VMX Z4 ( root name ) 
#define DECLARE VMX Z5 ( root_name ) 



3/9/2001 



DECLARE VMX VI ( root name ) 

DECLARE VMX V2 ( root name ) 

DECLARE VMX V3 ( root name ) 

DECLARE VMX V4 ( root name ) 

DECLARE VMX V5 ( root_name ) 



G4 macros to branch through the <root name> jump 
the value of the ESAL flag, (not supported in C) 
(uses rO as scratch and destroys eflag) 
(not supported in C) 



table based on 



*/ 

#define BR ESAL_JUMP TABLE_COMMON ( root name, rtemp 
addis rtemp, 0, CONCAT ( root name, jump@ha ) ; \ 
addi rtemp, rtemp, CONCAT ( root_name, _jump@l ); 
lwzx rtemp, rtemp, rO; \ 
mtctr rtemp; \ 
bctr; 



^define BR VMX VI ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 29, 29; \ 
BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 

tfdefine BR VMX V2 ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 28, 29; \ 
BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 

#define BR VMX V3 ( rootjname, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 27, 29; \ 
B R_E S AL_ JUM P_T AB L E_COMMON ( root_name, rtemp ) 

#define BR_VMX_V4 ( root_name, eflag, rtemp ) \ 
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rlwinm rO, eflag, 2, 26, 29; \ 

B R_ES AL_ JUM P_TABLE_COMMON ( root_name, rtemp ) 

fldefine BR VMX V5 ( root_name, eflag, rtemp ) \ 
rlwinm r0 # eflag, 2 # 25, 29; \ 



BR_ESAL_ 


_ JTJM P_T AB L E_COMMON ( 


root_name, rtemp 


#def ine 


BR 


VMX 


Zl( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_ 


_VMX_ 


VI ( 


root_ 


name , 


eflag, 


rtemp 


) 




#def ine 


BR 


VMX 


Z2 ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V2 ( 


root_ 


name , 


eflag, 


rtemp 


) 




#def ine 


BR 


VMX 


Z3( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR - 


_VMX_ 


_V3 { 


root 


name , 


eflag, 


rtemp 


) 




tfdefine 


BR 


VMX 


Z4 { 


root 


name , 


eflag, 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V4 ( 


root_ 


name , 


eflag, 


rtemp 


) 




ttdef ine 


BR 


VMX 


Z5( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_ 


VMX 


_V5( 


root 


_name , 


eflag, 


rtemp 


) 





#else /* no ESAL jump table */ 

/* 

* G4 macros to create a dummy jump table. . 

* (not supported in C) 
*/ 



#def ine 


DECLARE 


VMX 


VI ( 


root 


name 


tfdefine 


DECLARE 


VMX 


V2 ( 


root 


name 


#define 


DECLARE 


VMX 


V3( 


root 


name 


#def ine 


DECLARE 


VMX 


V4 { 


root 


name 


#def ine 


DECLARE^ 


_VMX_ 


V5( 


root_ 


name 


ttdef ine 


DECLARE 


VMX 


Zl( 


root 


name 


#def ine 


DECLARE 


VMX 


Z2( 


root 


name 


#def ine 


DECLARE 


VMX 


Z3( 


root 


name 


#def ine 


DECLARE 


VMX 


Z4 ( 


root 


name 


#def ine 


DECLARE 


VMX 


Z5( 


root_ 


name 



/* 

* G4 macros to simply branch to root_name (no jump table) 

* (not supported in C) 
*/ 

^define BR VMX VI ( root_name, eflag, rtemp ) \ 
b root_name; 

#define BR VMX V2 ( root_name, eflag , rtemp ) \ 
b root_name; 

#define BR VMX V3 ( root_name, eflag , rtemp ) \ 
b root_name; 

#define BR VMX V4 ( root_name, eflag , rtemp ) \ 
b root_name; 

#define BR VMX V5 ( root_name, eflag , rtemp ) \ 
b root name; 



#def ine 


BR 


VMX 


Zl ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_ 


_VMX_ 


VI ( 


root_ 


_name , 


eflag, 


rtemp 


) 




#def ine 


BR 


VMX 


Z2( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_ 


_VMX_ 


_V2( 


root 


name , 


eflag, 


rtemp 


) 




^define 


BR 


VMX 


Z3( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR 


VMX^ 


_V3( 


root 


name , 


eflag, 


rtemp 


) 
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#define BR VMX Z4 ( root name, eflag, rtemp ) \ 

BR_VMX_V4 ( root_name, eflag, rtemp ) 

fldefine BR VMX Z5 ( root name, eflag, rtemp ) \ 

BR_VMX_V5( root_name, eflag, rtemp ) 



flendif 



/* end COMPILE_ESAL_JUMP TABLE */ 



* G4 macros to decide whether to enter a VMX loop 

* VMX loop is entered if at least minimum count, 

* all vectors have the same relative alignment 

* (i.e., same lower 4 bits) and all strides are unit. 

* Note, a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddxO can be implemented with a VMX loop. 

* Only one macro should be invoked per source file. 

* (uses rO as scratch) 

* (not supported in C) 
*/ 

^define BR IF VMX VI ( root_name, min_n_imm, unit_s_imm, pi, si, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v skip_vmx; \ 

BR VMX VI ( root_name, eflag, si ) \ 
v_skip_vmx : 



ftdefine BR_IF_VMX_V1_ALIGNED ( root name, 

pi, si, n, 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
andi . rO, pi, Oxf; \ 
bne v skip_vmx; \ 

BR VMX VI ( root_name, eflag, si ) \ 
v_skip_vmx : 



min n_imm, 
eflag ) \ 



unit_s_imm, \ 



#define BR_IF_VMX_V2 ( root name, min 

pi, si, p2, s2 

imm;- \ 



cmplwi n, min 
bit v_skip vmx; \ 
cmpwi si, unit s imm; 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 
BR VMX V2( root_name, 
v_skip_vmx : 



n imm, unit_s_imm, \ 
n, eflag ) \ 



\ 
\ 

eflag, si ) \ 



tfdefine BR_IF_VMX_V2_LS ( root name, min n imm, unit_s_imm, \ 

pi, si, ps, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 1; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, rO, ps; \ 
bne v_skip vmx; \ 
andi. rO, rO, 0x6; \ 
bne v skip_vmx; \ 

BR_VMX_V2( root_name, eflag, si ) \ 
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v_skip_vmx: 

tfdefine BR_I F_VMX_V2_LC { root name, min_n imm, unit_s_imm, \ 

pi, si, pc, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
andi . rO, pc, 1; \ 
bne v_skip vmx; \ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 2; \ 
bne v skip vmx,- \ 
xor rO, rO, pc; \ 
andi. rO, rO, 0x3; \ 
bne v skip_vmx; \ 

BR VMX V2( root_name, eflag, si ) \ 
v_skip__vmx: 

#define BR_IF_VMX_V2_ALIGNED ( root name, min n imm, unit_s_imm, \ 

pi, si, p2, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V2( root_name, eflag, si ) \ 
v_skip__vmx: 

#define BR_IF_VMX_V3 { root name, min n imm, unit_s imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V3( root_name, eflag, si ) \ 
v_skip_vmx : 

Udefine BR_IF_VMX_V3_ALIGNED( root name, min n imm, unit_s imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
' bne v_skip vmx; \ 

cmpwi s3, unit_s_imm; \ 
or rO , pi , p2 ; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO , Oxf; \ 
bne v skip_vmx; \ 

BR_VMX_V3( root_name, eflag, si ) \ 
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v_skip_vmx: 

tfdefine BR IF VMX V4 ( root name, min n imm, unit s imm, \ 

_ _ _ p2 ^ p3 ^ s3f p4f s4| n< eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO , Oxf; \ 
bne v skip_vmx; \ 

BR VMX V4 ( root_name, eflag, si ) \ 
v_skip_vmx: 



#define BR IF VMX V4 ALIGNED ( root name, min n imm, unit s imm, \ 

- - " _ pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) 



\ 



cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO , Oxf; \ 
bne v skip_vmx; \ 

BR VMX V4 ( root_name, eflag, si ) \ 
v_skip_vmx: 



#define BR IF VMX V5 { root name, min n imm, unit s imm, \ 

~ ~ " pi, si, p2, s2, p3, s3, p4, s4, p5, s5, n, eflag ) \ 



cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s5, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
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bne v_skip vmx; \ 
andi . rO, rO, Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p5; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V5( root_name, eflag, si ) \ 
v_skip_vmx: 

fldefine BR_IF_VMX_V5_ALIGNED ( root_name, min n_imm, unit s_imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, p5, s5, n, eflag ) 
\ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s5, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p5; \ 
bne v_skip vmx; \ 
andi. rO, rO , Oxf; \ 
bne v skip_vmx; \ 

BR VMX V5{ root_name, eflag, si ) \ 
v_skip_vmx: 

#define BR_IF_VMX_Z1 { root_name, min n_imm, unit_s_imm, \ 

prl, pil, si, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, unit s imm; \ 
xor rO, prl, pi 2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Zl( root_name, -eflag, si ) \ 
z_skip_vmx : 

^define BR_IF_VMX_Z2 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
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xor rO, prl, pi2; \ 
bne z_skip vmx,- \ 
andi . rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z2( root_name, eflag, si ) \ 
z_skip_vmx: 

tfdefine BR_IF_VMX_Z3 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, unit s intm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s3, unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi 2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi 3; \ 
bne z_skip vmx; \ 
andi .• rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z3 ( root_name, eflag, si ) \ 
z_skip_vmx: 

#define BR_IF_VMX_Z4 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 
pr4, pi4, s4 , n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z_skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s2, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi S3, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s4, unit s imm; \ 

xor rO, prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi 3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr4 ; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi4; \ 

bne z_skip_vmx; \ 
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andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX 24 ( root_name, eflag, si ) \ 
z_skip_vmx: 

tfdefine BR_IF_VMX_Z5 ( root_name, min n imm, unit s imm, \ 

prl # pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 
pr4, pi4, s4, pr5, pi5, s5, n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z_skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s2, unit s imm,- \ 

bne z_skip vmx; \ 

cmpwi s3, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s4, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s5, unit s imm; \ 

xor rO, ■ prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi 2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr4 ; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi4; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr5; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi 5; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

bne z skip_vmx; \ 

BR VMX Z5( root_name, eflag, si ) \ 
z_skip_vmx : 

^define BR_IF_VMX_CONV( root name, min n imm, \ 

pi, si, s2, p3, s3, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, 1; \ 
bne v__skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V3( root_name, eflag, si ) \ 
v_skip_vmx: 
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tfdefine BR_IF_VMX_ZCONV ( root_name, min n imm, \ 

prl, pil # si, s2 # pr3 # pi3, s3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, 1; \ 
bne z_skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
cmpwi s2 , - 1 ; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor r0 ( prl, pil; \ 
bne z_skip vmx; \ 
andi . rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z3 ( root_name, eflag, si ) \ 
z__skip_vmx: 

/* 

* G4 macro to get VMX unaligned word (FP) count 

* assumes that the last 2 bits of ptr are 0 

* sets condition code CRO 
*/ 

ttdefine GET VMX UNALIGNED_COUNT ( count, ptr ) \ 
neg count, ptr; \ 

rlwinm. count, count, 30, 30, 31; 

/* 

* G4 macro to get VMX unaligned short count 

* assumes that the last bit of ptr is 0 

* sets condition code CRO 
*/ 

#define GET VMX UNALIGNED_COUNT_S ( count, ptr ) \ 
neg count, ptr; \ 

rlwinm. count, count, 31, 29, 31; 

/* 

* G4 macro to get VMX unaligned char count 

* sets condition code CRO 
*/ 

ttdefine GET VMX UNALIGNED_COUNT_C ( count, ptr ) \ 
neg count, ptr; \ 
rlwinm. count, count, 0, 28, 31; 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

#if defined ( LITTLE END I AN ) 

^define SCALAR_ SPLAT ( vt , vtmp, scalarp ) \ 

lvxl vt, 0, scalarp; \ 

lvsr vtmp, 0, scalarp; \ 

vperm vt, vt , vt, vtmp; \ 

vspltw vt, vt, 3; 
#else 

^define SCALAR__SPLAT ( vt , vtmp, scalarp ) \ 
lvxl vt, 0, scalarp; \ 
Ivsl vtmp, 0, scalarp; \ 
vperm vt, vt , vt, vtmp; \ 
vspltw vt, vt, 0; 
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tfendif 

/* 

* G4 macro to construct an FP absolute value mask that can be used with 

* vand to take the absolute value of 4 FP numbers in a vector register 

* vt = 0x7fffffff7fffffff7ff fffff7fffffff 
*/ 

define MAKE VABS MASK ( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; \ 
vnor vt, vt, vt; 

/* 

* G4 macro to construct an FP sign mask that can be used with: 

* vandc to take the absolute value of 

* vor to take the negative absolute value of 

* vxor to negate 

* 4 FP numbers in a vector register 

* vt = 0x80000000800000008000000080000000 
*/ 

#define MAKE VSIGN_MASK( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; 



G4 macros to construct a coded touch stream control register 
"I" indicates argument is passed as an immediate value 
"R" indicates argument is passed in an integer register 



* bytesjper block = # of bytes in each block 

* (0 = 512, 16, 32, 480, 512) 

* block count = # of blocks (0 = 256, 1, 2, 3, ... 256) 

* byte stride = signed byte stride between start of adjacent blocks 

* (-32768 <= byte_stride < 0; 0 = 32768; 0 < byte_stride < 32768) 
*/ 

#define MAKE_STREAM_CODE_III { rB, bytes_j>er_block, block_count, byte_stride ) 

lis rB, (({(bytes per block) >> 4) & 31) << 8) | ( (block_count ) & 255); \ 
ori rB, rB, ( (byte_stride) & OxOOOOf f f f ) ; 

#define MAKE STREAM CODE( rB, bytes per block, block count, byte stride ) \ 

MAKE_STREAM_CODE_III ( rB, bytes_per_block , block_count, byte_stride ) 

#define MAKE_STREAM__CODE_I IR ( rB, bytes_per_block , block_count, byte_stride ) 
\ 

lis rB, ((((bytes per block) » 4) & 31) << 8) | ( (block_count ) & 255); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#define MAKE_STREAM_CODE_IRI ( rB, bytes_per_block, block_count, byte_stride ) 
\ 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ((((bytes per_block) >> 4) & 31) « 8); \ 
ori rB, rB, ( (byte_stride) & OxOOOOf fff) ; 

fldefine MAKE_STREAM_CODE_IRR ( rB, bytes _jper_block , block_count, byte_stride ) 

\ 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ((((bytes per_block) >> 4) & 31) << 8); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

^define MAKE_STREAM_CODE_RII ( rB, bytes_per_block, block_count, byte_stride ) 
\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, ( (block count) & 255) ; \ 
ori rB, rB, ( (byte_stride) & OxOOOOff ff ) ; 

#define MAKE_STREAM_CODE_RIR ( rB, bytes_per_block , block_count, byte_stride ) 
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rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, ((block count) & 255); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

tfdefine MAKE_STREAM_CODE_RRI ( rB, bytes_per_block, block_count, byte_stride ) 

\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 

rlwimi rB, block count, 16, 8, 15; \ 

ori rB, rB, ( (byte_stride) & OxOOOOff ff ) ; 

tfdefine MAKE_STREAM_CODE_RRR ( rB, bytes_per_block, block_count, byte_stride ) 
\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 
rlwimi rB, block count, 16, 8, 15; \ 
rlwimi rB, byte_stride, 0, 16, 31; 

tfendif /* end BUILD_MAX */ 

#def ine CACHE TB THRESHOLD 1 /* 2 TB ticks = 12 CPU 100 MHz elks */ 

#define INSTRUCTION CACHE COUNT 3 /* min. to fully cache instructions */ 

ttdefine POSTING_BUFFER_COUNT 10 /* min. to fill posting buffer */ 

/* 

* macros to set DCBx conditions explicitly 
*/ 

#define DCBT TRUE ( cond_bit, scratch ) \ 
li scratch, 0; \ 
cmplwi (cond_bit), scratch, 1; 

tfdefine DCBZ TRUE ( cond_bit, scratch ) \ 
DCBT_TRUE{ cond_bit, scratch ) 

#define DCBT FALSE ( cond_bit, scratch ) \ 
li scratch, 2; \ 
cmplwi (cond_bit) , scratch, 1; 

#define DCBZ FALSE ( cond_bit, scratch ) \ 
DCBT_FALSE( cond_bit, scratch ) 

/* 

* This macro will cause a file not to assemble. 
*/ 

• ttdefine DO_NOT_ASSEMBLE add scratchl, scratch2, 256; 
/* 

* Obsolete macro will cause assembler error 
*/ 

#define TEST IF CACHABLE ( cond_bit, buffer, scratchl, scratch2 ) \ 
DO_NOT_ASSEMBLE 

/* 

* Obsolete macro will cause assembler error 
*/ 

#define TEST IF CACHABLE_ALIGN ( cond_bit, buffer, scratchl, scratch2 ) \ 
D0_N0T_ASSEMBLE 

/* 

* macros to test if a DCBT or DCBZ instruction should be performed on 

* a particular buffer based on a bit test (cache bit) on a specified 

* ESAL flag. 
*/ 

tfdefine TEST_I FJDCBT ( cond_bit, cache__bit, eflag, bufer, scratchl, scratch2 ) 
\ 

DO_NOT_ASSEMBLE 

tfdefine SET_DCBT_COND ( cond_bit, cache_bit, eflag, scratchl ) \ 
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andi. scratchl, eflag, (cache bit); \ 
cmplwi (cond_bit) , scratchl, 0; 

/* 

Set 2 debt conditions and ensure only one is true 



* 


Ins . 


1- 


3 


Set both conditions to "No DCBT" 


* 


Ins . 


4 




See if veel has a C 


* 


Ins . 


5 




Set DCBT condl 


* 


Ins . 


6 




Branch if "DCBT TRUE" (eflag & bitl 


★ 


Ins . 


7- 


8 


Set DCBT cond2 


*/ 











0) 

fldefine SET_2_DCBT_COND ( condl bit, cache_bitl, cond2_bit, cache_bit2, \ 

eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl bit), scratch, 0; \ 

be 12, ( (condl_bit)<<2)+2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2_bit) , scratch, 0; 

/* 

* Set 3 debt conditions and ensure only one is true 
* 

* Logic is the similar to SET 2 DCBT COND() macro 
*/ - - - 

#define SET_3_DCBT_COND ( condl bit, cache bitl, cond2 bit, cache_bit2, \ 

cond3__bit, cache_bit3, eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bit3) ; \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( (cond3_bit)<<2) +2, PC OFFSET ( 24 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit)<<2)+2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bitl ) ; \ 

cmplwi (condl_bit) , scratch, 0; 

/* 

* Set 4 debt conditions and ensure only one is true 
* 

* Logic is the similar to SET 2 DCBT COND() macro 
*/ ~ " " . 

#define SET_4_DCBT_COND ( condl bit, cache bitl, cond2 bit, cache bit2, \ 

cond3 bit, cache_bit3, cond4_bit, cache_bit4, \ 
eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit), scratch, 1; \ 

cmplwi (cond4 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bit4 ) ; \ 

cmplwi (cond4 bit), scratch, 0; \ 

be 12, ( (cond4_bit) <<2) +2, PC OFFSET ( 36 ) ; \ 

andi. scratch, eflag, (cache_bit3) ; \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( (cond3_bit) <<2) +2, PC OFFSET ( 24 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit)<<2)+2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl_bit) , scratch, 0; 
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#if 'defined COMPILE_NO_DCBZ 

#define SET_DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2, tmp3) \ 
andi. tmp3, eflag, (cache bit); \ 
cmplwi (cond bit), tmp3, 0; \ 
bne PC_OFFSET( 104 ) ; \ 
cmplwi 1, stride, unit stride; \ 
bne 1, PC_OFFSET( 92 ) ; \ 

cmplwi 1, count, (CACHE_LINE_LSIZE<<unit_stride) ; \ 

bit 1, PC OFFSET ( 84 ) ; \ 

addi tmp2, buffer, CACHE LINE SIZE; \ 

li tmp3, CACHE LINE ADDR_MASK; \ 

and tmp2 , tmp2 , tmp3 ; \ 

mfcr tmp3; \ 

Stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

Stw tmp3, LR SAVE OFF(sp) ; \ 

CREATE STACK_FRAME( 0 ) \ 

mr tmpl , r3 ; \ 

mr r3, tmp2 ; \ 

bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 

Iwz tmp3, LR_SAVE_OFF{sp) ; \ 

mtlr tmp3; \ 

lwz tmp3 ( CR_SAVE_OFF(sp) ; \ 

mtcr tmp3 ; \ 

li tmp2, 0; \ 

cmplw 1, tmp2, r3; \ 

mr r3, tmpl; \ 

bne 1, PC OFFSET ( 8 ) ; \ 

cmpwi (cond_bit) , count, -1; 

ttdefine SET_DCBZ_ALIGN_COND { cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2, tmp3) \ 
andi. tmp3, eflag, {cache bit) ; \ 
cmplwi (cond bit) , tmp3, 0; \ 
bne PC_OFFSET( 100 ) ; \ 
cmplwi 1, stride, unit stride; \ 
bne 1, PC_OFFSET( 88 ) ; \ 

cmplwi 1, count, (CACHE_LINE_LSIZE<<unit_stride) ; \ 

bit 1, PC OFFSET ( 80 ) ; \ 

andi. tmp3, buffer, CACH E_L I NE_MAS K ; \ 

bne PC OFFSET ( 72 ) ; \ 

mfcr tmp3; \ 

stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

stw tmp3, LR SAVE OFF(sp); \ 

CREATE STACK_FRAME( 0 ) \ 

mr tmpl , r3 ,- \ 

mr r3, buffer; \ 

bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 

lwz tmp3, LR_SAVE_OFF(sp) ; \ 

mtlr tmp3; \ 

lwz tmp3, CR_SAVE_OFF(sp) ; \ 

mtcr tmp3 ; \ 

li tmp2, 0; \ 

cmplw 1, tmp2, r3; \ 

mr r3 , tmpl ; \ 

bne 1, PC OFFSET ( 8 ) ; \ 

cmpwi (condjbit) , count, -1; 

#else /* COMPILE_NO_DCBZ is defined */ 

tfdefine SET_DCBZ_COND ( cond_bit, cache_bit, eflag, buffer, stride, \ 
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unit stride, count, tmpl, tmp2, tmp3) \ 
DCBZ_FALSE( cond_bit, tmpl ) 

fldefine SET_DCBZ_ALIGN_COND( cond bit, cache bit, eflag, buffer, stride, \ 

unit_stride, count, tmpl, tmp2, tmp3) \ 
DCBZ_FALSE( cond_bit, tmpl ) 

flendif /* COMPILE_NO_DCBZ */ 
/* 

* macro to perform [or skip] a debt instruction based on the result 

* of a prior call to TEST IF DCBT {specifying the same condition bit) . 

* debt is performed if the cond "<=" is true; otherwise debt is skipped. 
*/ 

#define DCBT IF( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)<<2)+l, PC_OFFSET( 8 ) ; \ 
debt rA, rB; 



/* 

* macro to perform [or skip] a debz instruction based on the result 

* of a prior call to TEST IF DCBZ (specifying the same condition bit) . 

* debz is performed if the cond "<=" is true; otherwise debz is skipped. 
*/ 

#if ! defined COMPILE_NO_DCBZ 

^define DCBZ IF( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)<<2)+l, PC_OFFSET( 8 ); \ 
debz rA, rB; 



#else 



#define DCBZ IF( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)«2)+l, PC_OFFSET( 8 ); \ 
nop; 



#endif 



/* 

* macro to branch to a label if the buffer specified in a prior 

* call to TEST_IF CACHABLE (also specifying the same condition bit) 

* was cachable (i.e. TB read time was <= CACHE_TB_THRESHOLD) . 
*/ 

#define BR IF COND TRUE ( cond bit, label ) \ 

be 4, ( (cond_bit) «2)+l, label; /* <= */ 

/* 

* macro to branch to a label if the buffer specified in a prior 

* call to TEST IF CACHABLE (also specifying the same condition bit) 

* was NOT cachable (i.e. TB read time was > CACHE_TB_THRESHOLD) . 
*/ 

#define BR IF COND FALSE ( cond bit, label ) \ 

be 12, ( (cond_bit) «2)+l, label; /* > */ 

/* 

* ASIC macros 
*/ 

#if defined ( COM P I LE_PREFETCH ) 

fldefine LOAD PREFETCH_CONTROL ( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 

addis scratch2, 0, PREFETCH CONTROL H; \ 

stw scratchl, PREFETCH_CONTROL_L ( scratch2 ); 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 
addis scratch2, 0, MISC0N_B H; \ 
stw scratchl, MISCON_B_L( scratch2 ) ; 
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#define RESET PREFETCH CONTROL ( scratchl, scratch2 ) \ 
addis scratch2 # 0, ASIC H; \ 
lwz scratchl, MISCON B L{ scratch2 ); \ 
andi. scratchl, scratchl, PREFETCH MASK; \ 
ori scratchl, scratchl, USE PREFETCH CONTROL; \ 
stw scratchl, PREFETCH_CONTROL_L ( scratch2 ); 

#else 

tfdefine LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 

#define LOAD MISCON B{ mode, scratchl, scratch2 ) 

#define RESET_PREFETCH_CONTROL ( scratchl, scratch2 ) 

#endif 
/* 

* instruction macros 
*/ 

#define ADD ( rD, rA, rB ) 
#define ADD C( rD, rA, rB ) 
#define ADDI ( rD, rA, SIMM ) 
tfdefine ADDIC C( rD, rA, SIMM ) 
tfdefine ADDIS ( rD, rA, SIMM ) 
tfdefine AND ( rA, rS, rB ) 
#define AND C( rA, rS, rB ) 
#define ANDC ( rA, rS, rB ) 
#define ANDC C( rA, rS, rB ) 
#define ANDI C( rA, rS, UIMM ) 
tfdefine ANDIS C( rA, rS, UIMM ) 
#define BA( label ) 
#define BCTR 
#define BCTRL 
#define BEQ ( label ) 
#define BEQ PLUS ( label ) 
#define BEQ MINUS ( label ) 
#define BEQ CR{ bit, label ) 
ttdefine BEQ CR PLUS ( bit, label ) 
#define BEQ CR_MINUS { bit, label ) 
fldefine BEQLR 
#define BEQLR PLUS 
#define BEQLR MINUS 
#define BEQLR CR ( bit ) 
#define BEQLR CR PLUS ( bit ) 
#define BEQLR CR MINUS ( bit ) 
#define BGE { label ) 
#define BGE PLUS ( label ) 
tfdefine BGE MINUS ( label ) 
^define BGE CR( bit, label ) 
^define BGE CR PLUS ( bit, label ) 
#define BGE CR_MINUS ( bit, label ) 
#define BGELR 
tfdefine BGELR PLUS 
#define BGELR MINUS 
#define BGELR CR< bit ) 
#define BGELR CR PLUS ( bit ) 
tfdefine BGELR CR MINUS ( bit ) 
tfdefine BGT ( label ) 
#define BGT PLUS ( label ) 
tfdefine BGT MINUS ( label ) 
#define BGT CR( bit, label ) 
tfdefine BGT CR PLUS ( bit, label ) 
tfdefine BGT CR_MINUS ( bit, label ) 
#define BGTLR 
#define BGTLR PLUS 
tfdefine BGTLR MINUS 
#define BGTLR_CR ( bit ) 



add rD, rA, rB; 
add. rD, rA, rB; 
addi rD, rA, (SIMM) ; 
addic. rD, rA, (SIMM) ; 
addis rD, rA, (SIMM) ; 
and rA, rS, rB; 
and. rA, rS, rB; 
andc rA, rS, rB; 
andc. rA, rS, rB; 
andi. rA, rS, (UIMM) ; 
andis. rA, rS, (UIMM) ; 
ba label; 
bctr; 
bctrl; 
beq label ; 
beq+ label; 
beq- label; 
beq (bit) , label; 
beq+ (bit), label; 
beq- (bit) , label ; 
beqlr; 
beqlr+ ; 
beqlr- ; 
beqlr (bit) ; 
beqlr+ (bit) ; 
beqlr- (bit) ; 
bge label ; 
bge+ label; 
bge- label ; 
bge (bit) , label; 
bge+ (bit), label; 
bge- (bit) , label ; 
bgelr ; 
bgelr+ ; 
bgelr- ; 
bgelr (bit) ; 
bgelr+ (bit) ; 
bgelr- (bit) ,- 
bgt label ; 
bgt+ label; 
bgt- label; 
bgt (bit), label; 
bgt+ (bit), label; 
bgt- (bit), label; 
bgtlr; 
bgtlr+; 
bgtlr- ; 
bgtlr (bit) ; 
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#define BGTLR CR PLUS ( bit ) 

#define BGTLR CR MINUS ( bit ) 

#define BL{ label ) 

ffdefine BLE { label ) 

#define BLE PLUS ( label ) 

#define BLE MINUS ( label ) 

#define BLE CR( bit, label ) 

ffdefine BLE CR PLUS ( bit, label ) 

#define BLE CR_MINUS( bit, label ) 

ffdefine BLELR 

#define BLELR PLUS 

#define BLELR MINUS 

ffdefine BLELR CR( bit ) 

#define BLELR CR PLUS( bit ) 

#define BLELR_CR_MINUS ( bit ) 

#define BLR 

ffdefine BLRL 

ffdefine BLT ( label ) 

ffdefine BLT PLUS ( label ) 

#define BLT MINUS ( label ) 

#define BLT CR( bit, label ) 

ffdefine BLT CR PLUS ( bit, label ) 

ffdefine BLT CR_MINUS ( bit, label ) 

#define BLTLR 

ffdefine BLTLR PLUS 

#define BLTLR MINUS 

ffdefine BLTLR CR( bit ) 

#define BLTLR CR PLUS( bit ) 

#define BLTLR CR MINUS ( bit ) 

#define BNE { label ) 

#define BNE PLUS ( label ) 

#define BNE MINUS ( label ) 

ffdefine BNE CR( bit, label ) 

#define BNE CR PLUS ( bit, label ) 

ffdefine BNE CR_MINUS ( bit, label ) 

ffdefine BNELR 

ffdefine BNELR PLUS 

ffdefine BNELR MINUS 

ffdefine BNELR CR( bit ) 

ffdefine BNELR CR PLUS( bit ) 

ffdefine BNELR CR MINUS ( bit ) 

ffdefine BR ( label ) 

ffdefine CLRLWI ( rA, rS, nbits ) 

ffdefine CLRLWI C( rA, rS, nbits ) 

ffdefine CLRRWI ( rA, rS, nbits ) 

ffdefine CLRRWI_C( rA, rS, nbits ) 

ffdefine CMPLW ( rA, rB ) 

ffdefine CMPLW CR( bit, rA, rB ) 

ffdefine CMPLWI ( rA, UIMM ) 

ffdefine CMPLWI CR ( bit, rA, UIMM ) 

ffdefine CMPW ( rA, rB ) 

ffdefine CMPW CR{ bit, rA, rB ) 

ffdefine CMPWI ( rA, SIMM ) 

ffdefine CMPWI_CR( bit, rA, SIMM ) 

ffdefine DCBF ( rA, rB ) 

ffdefine DCBI { rA, rB ) 

ffdefine DCBST( rA, rB ) 

ffdefine DCBT ( rA, rB ) 

ffdefine DCBTST { rA, rB ) 

ffif !defined COMPILE_NO_DCBZ 

ffdefine DCBZ( rA, rB ) 

ffelse 

ffdefine DCB2{ rA, rB ) 
ffendif 

ffdefine DECR ( rD ) 
ffdefine DECR C( rD ) 
ffdefine DIVW( rD, rA, rB ) 
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bgtlr+ (bit); 

bgtlr- (bit); 

bl label; 

ble label; 

ble+ label; 

ble- label; 

ble (bit) , label; 

ble+ (bit), label; 

ble- (bit), label; 

blelr; 

blelr* ; 

blelr-; 

blelr (bit) ; 

blelr+ (bit) ; 

blelr- (bit); 

blr; 

blrl; 

bit label; 

blt+ label; 

bit- label; 

bit (bit) , label; 

blt+ (bit), label; 

bit- (bit), label; 

bltlr; 

bltlr+; 

bltlr-; 

bltlr (bit) ; 

bltlr+ (bit); 

bltlr- (bit); 

bne label; 

bne+ label; 

bne- label; 

bne (bit) , label; 

bne+ (bit) , label ; 

bne- (bit) , label ; 

bnelr; 

bnelr+ ; 

bnelr- ; 

bnelr (bit) ; 

bnelr+ (bit) ; 

bnelr- (bit) ; 

b label; 

clrlwi rA, rS, (nbits) ; 
clrlwi. rA, rS, (nbits); 
clrrwi rA, rS, (nbits) ; 
clrrwi. rA, rS, (nbits); 
cmplw rA, rB; 
cmplw bit, rA, rB; 
cmplwi rA, (UIMM) ; 
cmplwi bit, rA, (UIMM) ; 
cmpw rA, rB; 
cmpw bit, rA, rB; 
cmpwi rA, (SIMM) ; 
cmpwi bit, rA, (SIMM) ; 
dcbf rA, rB; 
dcbi rA, rB; 
dcbst rA, rB; 
debt rA, rB; 
debt st rA, rB; 

debz rA, rB; 

nop; 

addi rD, rD, -1; 
addic. rD, rD, -1; 
divw rD, rA, rB; 
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tfdefine DIVW C( rD, rA, rB ) 
#define DIVWU ( rD, rA, rB ) 
tfdefine DIVWU C( rD, rA, rB ) 
^define EQV( rA, rS, rB ) 
tfdefine EQV C( rA, rS, rB ) 
tfdefine EXTLWI ( rA, rS, n, b ) 
fldefine EXTLWI C( rA, rS, n, b ) 
ttdefine EXTRWI ( rA, rS, n, b ) 

define EXTRWI C( rA, rS, n, b ) 
#define FABS ( frD, frB ) 
#define FADD ( frD, frA, frB ) 
#define FADDS ( frD, frA, frB ) 
tfdefine FCMPO( bit, frA, frB ) 
#define FCMPU( bit, frA, frB ) 
#define FCTIW( frD, frB ) 
#define FCTIWZ( frD, frB ) 
#define FDIV( frD, frA, frB ) 
#define FDIVS( frD, frA, frB ) 
#define FMADD { frD, frA, frC, frB ) 
tfdefine FMADDS ( frD, frA, frC, frB 
#define FMOV( frD, frB ) 
tfdefine FMR ( frD, frB ) 
tfdefine FMUL ( frD, frA, frB ) 
tfdefine FMULS( frD, frA, frB ) 
#define FMSUB( frD, frA, frC, frB ) 
#define FMSUBS ( frD, frA, frC, frB 
#define FNABS ( frD, frB ) 
tfdefine FNEG ( frD, frB ) 
tfdefine FNMADD ( frD, frA, frC, frB 
#define FNMADDS ( frD, frA, frC, frB 
^define FNMSUB ( frD, frA, frC, frB 
tfdefine FNMSUB S ( frD, frA, frC, frB 
#define FRES { frD, frB ) 
#define FRSP( frD, frB ) 
#define FRSQRTE ( frD, frB ) 
fldefine FSEL ( frD, frA, frC, frB ) 
fldefine FSUB ( frD, frA, frB ) 
#define FSUBS( frD, frA, frB ) 
tfdefine GOTO ( label ) 
#define INCR( rD ) 
#define INCR C ( rD ) 
#define INSLWI ( rA, rS, n, b ) 
#define INSLWI_C( rA, rS, n, b ) 
+(n)-l; 

#define INSRWI ( rA, rS, n, b ) 
+(n)-l; 

^define INSRWI_C( rA, rS, n, b ) 
+<n)-l; 

#define LA ( rD, symbol, SIMM ) 

#define LABEL ( label ) 
#define LBZ( rD, rA, d ) 
#define LBZA { rD, symbol ) 

#define LBZU ( rD, rA, d ) 
#define LBZUX( rD, rA, rB ) 
tfdefine LBZX ( rD, rA, rB ) 
tfdefine LFD ( frD, rA, d ) 
tfdefine LFDU ( frD, rA, d ) 
^define LFDUX( frD, rA, rB ) 
tfdefine LFDX ( frD, rA, rB ) 
tfdefine LFS( frD, rA, d ) 
tfdefine LFSA { frD, symbol, rT ) 

tfdefine LFSU{ frD, rA, d ) 
#define LFSUX( frD, rA, rB ) 
^define LFSX ( frD, rA, rB ) 
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divw. rD, rA, rB; 

divwu rD, rA, rB; 

divwu. rD, rA, rB; 

eqv rA, rS, rB; 

eqv. rA, rS, rB; 

rlwinm rA, rS, (b) , 0, (n)-l; 

rlwinm. rA, rS, (b) , 0, (n)-l; 

rlwinm rA, rS, (b) + (n) , 32- (n) , 31; 

rlwinm. rA, rS, (b)+(n), 32- (n), 31; 

fabs frD, frB; 

fadd frD, frA, frB; 

fadds frD, frA, frB; 

fcmpo bit, frA, frB; 

fcmpu bit, frA, frB; 

fctiw frD, frB; 

fctiwz frD, frB; 

fdiv frD, frA, frB; 

fdivs frD, frA, frB; 

fmadd frD, frA, frC, frB; 
) fmadds frD, frA, frC, frB; 

FMR ( frD, frB ) 

fmr frD, frB; 

fmul frD, frA, frB; 

fmuls frD, frA, frB; 

fmsub frD, frA, frC, frB; 
) fmsubs frD, frA, frC, frB; 

fnabs frD, frB; 

fneg frD, frB; 
) fnmadd frD, frA, frC, frB; 

) fnmadds frD, frA, frC, frB; 
) fnmsub frD, frA, frC, frB; 
) fnmsubs frD, frA, frC, frB; 

fres frD, frB; 

frsp frD, frB; 

frsqrte frD, frB; 

fsel frD, frA, frC, frB; 

fsub frD, frA, frB; 

fsubs frD, frA, frB; 

BR( label ) 

addi rD, rD, 1; 

addic. rD, rD, 1; 

rlwimi rA, rS, 32- (b), (b) , (b)+(n)-l; 
rlwimi. rA, rS, 32- (b) , (b) , (b) 

rlwimi rA, rS, 32- ( (b) + (n) ) , <b) , (b) 

rlwimi. rA, rS, 32 - ( (b) + (n) ) , (b) , <b) 

addis rD, 0, ( symbol + (SIMM) ) @ha ; \ 
addi rD, rD, ( symbol + (SIMM) ) @1 ; 
label: 

lbz rD, (d) (rA) ; 

addis rD, 0, (symbol) @ha,- \ 

lbz rD, (symbol ) @1 (rD) ; 

lbzu rD, (d) (rA) ; 

lbzux rD, rA, rB; 

lbzx rD, rA, rB; 

lfd frD, (d) (rA) ; 

Ifdu frD, (d) (rA) ; 

Ifdux frD, rA, rB; 

lfdx frD, rA, rB; 

lfs frD, (d) (rA) ; 

addis rT, 0, (symbol )@ha; \ 

lfs frD, (symbol) @1 (rT) ; 

lfsu frD, (d) (rA) ; 

lfsux frD, rA, rB; 

lfsx frD, rA, rB; 
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#define LHA ( rD, rA, d ) 
tfdefine LHAA ( rD, symbol ) 

tfdefine LHAU ( rD, rA, d ) 
tfdefine LHAUX( rD, rA, rB ) 
ffdefine LHAX ( rD, rA, rB ) 
#define LHZ( rD, rA, d ) 
#define LHZA ( rD, symbol ) 

#define LHZU( rD, rA, d ) 
#define LHZUX( rD, rA, rB ) 
#define LHZX( rD, rA, rB ) 
#define LI ( rD, SIMM ) 
ttdefine LIS{ rD, SIMM ) 
#define LOAD_COUNT ( rD ) 
#define LWZ ( rD, rA, d ) 
fldefine LWZA( rD, symbol ) 

tfdefine LWZU( rD, rA, d ) 
tfdefine LWZUX( rD, rA, rB ) 
ttdefine LWZX( rD, rA, rB ) 
tfdefine MCRF ( crfD, crfS ) 
#define MCRFS( crfD, crfS ) 
#define MFCR ( rD ) 
tfdefine MFCTR ( rD ) 
#define MFLR ( rD ) 
#define MFSPR( rD, SPR ) 
tfdefine MR ( rA, rS ) 
#define MR C( rA, rS ) 
#define MOV( rA, rS ) 
#define MOV C( rA, rS ) 
#define MTCR ( rD ) 
#define MTCTR ( rD ) 
#define MTFSFI ( crfD, I MM )• 
tfdefine MTLR ( rD ) 
#define MTSPR( SPR, rS ) 
#def ine MULLI ( rD, rA, SIMM ) 
#define MULLW ( rD, rA, rB ) 
#define MULLW_C( rD, rA, rB ) 
#define NAND ( rA, rS, rB ) 
#define NAND_C( rA, rS, rB ) 
#define NEG ( rD, rA ) 
#define NEG_C( rD, rA ) 
#define NOP 

#define NOR ( rA, rS, rB ) 
#define NOR_C( rA, rS, rB ) 
#define OR( rA, rS, rB ) 
#define OR C( rA, rS, rB ) 
#define ORC ( rA, rS, rB ) 
#define ORC C( rA, rS, rB ) 
tfdefine ORI ( rA, rS, UIMM ) 
tfdefine ORIS( rA, rS, UIMM ) 
#define RETURN 

#define RLWIMI ( rA, rS, SH, MB, ME ) 
tfdefine RLWIMI C( rA, rS, SH, MB, ME ) 
tfdefine RLWINM ( rA, rS, SH, MB, ME ) 
tfdefine RLWINM_C( rA, rS, SH, MB, ME ) 
tfdefine RLWNM ( rA, rS, rB, MB, ME ) 
tfdefine RLWNM C{ rA, rS, rB, MB, ME ) 
fldefine ROTLW( rA, rS, rB ) 
tfdefine ROTLW C{ rA, rS, rB ) 
fldefine ROTLW I ( rA, rS, n ) 
#define ROTLWI C( rA, rS, n ) 
fldefine ROTRWI { rA, rS, n ) 
tfdefine ROTRWI C( rA, rS, n ) 
tfdefine SLW ( rA, rS, rB ) 
#define SLW_C( rA, rS, rB ) 



lha rD, (d) (rA) ; 

addis rD, 0, ( symbol )@ha; \ 

lha rD, ( symbol ) @1 (rD) ; 

lhau rD, (d) (rA) ; 

lhaux rD, rA, rB; 

lhax rD, rA, rB; 

lhz rD, (d) (rA) ; 

addis rD, 0, ( symbol )@ha; \ 

lhz rD, (symbol) @1 (rD) ; 

lhzu rD, (d) (rA) ; 

lhzux rD, rA, rB; 

lhzx rD, rA, rB; 

li rD, (SIMM) ; 

lis rD, (SIMM) ; 

mtctr rD; 

lwz rD, (d) (rA) ; 

addis rD, 0, (symbol )@ha; \ 

lwz rD, (symbol) @1 (rD) ; 

lwzu rD, (d) (rA) ; 

lwzux rD, rA, rB; 

lwzx rD, rA, rB; 

mcrf crfD, crfS; 

mcrfs crfD, crfS; 

mfcr rD; 

mfctr rD; 

mflr rD; 

mfspr rD, SPR; 

mr rA, rS; 

or. rA, rS, rS; 

MR ( rA, rS ) 

MR C( rA, rS ) 

mtcr rD; 

mtctr rD; 

mtfsfi (crfD), (IMM) ; 

mtlr rD; 

mtspr SPR, rS; 

raulli rD, rA, (SIMM) ; 

mullw rD, rA, rB; 

mullw. rD, rA, rB; 

nand rA, rS, rB; 

nand. rA, rS, rB; 

neg rD, rA; 

neg . r D , r A ; 

nop; 

nor rA, rS, rB; 
nor. rA, rS, rB; 
or rA, rS, rB; 
or. rA, rS, rB; 
ore rA, rS, rB; 
ore . rA, . rS, rB; 
ori rA, rS, (UIMM) ; 
oris rA, rS, (UIMM) ; 
BLR 

rlwimi rA, rS, SH, MB, ME; 
rlwimi. rA, rS, SH, MB, ME; 
rlwinm rA, rS, SH, MB, ME; 
rlwinm. rA, rS, SH, MB, ME; 
rlwnm rA, rS, rB, MB, ME; 
rlwnm. rA, rS, rB, MB, ME; 
rlwnm rA, rS, rB, 0, 31; 
rlwnm. rA, rS, rB, 0, 31; 
rlwinm rA, rS, (n) , 0, 31; 
rlwinm. rA, rS, (n) , 0, 31; 
rlwinm rA, rS, 32- (n) , 0, 31; 
rlwinm. rA, rS, 32- (n) , 0, 31 
slw rA, rS, rB; 
slw. rA, rS, rB; 
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tfdefine SLWI ( rA, rS, SH ) 
ffdefine SLWI C( rA, rS, SH ) 
tfdefine SRAW( rA, rS, rB ) 
^define SRAW C( rA, rS, rB ) 
#define SRAW I ( rA, rS, SH ) 
tfdefine SRAWI C{ rA, rS, SH ) 
#define SRW( rA, rS, rB ) 
#define SRW C( rA, rS, rB ) 
#define SRWI ( rA, rS, SH ) 
#define SRWI_C( rA, rS, SH ) 
tfdefine STB( rS, rA, d ) 
tfdefine STBU( rS, rA, d ) 
^define STBUX( rS, rA, rB ) 
#define STBX( rS, rA, rB ) 
tfdefine STFD ( frD, rA, d ) 
tfdefine STFDU( frD, rA, d ) 
tfdefine STFDUX ( frD, rA, rB ) 
tfdefine STFDX( frD, rA, rB ) 
#define STFS( frD, rA, d ) 
ttdefine STFSU( frD, rA, d ) 
#define STFSUX( frD, rA, rB ) 
#define STFSX( frD, rA, rB ) 
tfdefine STH( rS, rA, d ) 
#define STHU{ rS, rA, d ) 
#define STHUX( rS, rA, rB ) 
#define STHX( rS, rA, rB ). 
#define STW( rS, rA, d ) 
tfdefine STWU( rS, rA, d ) 
#define STWUX( rS, rA, rB ) 
#define STWX( rS, rA, rB ) 
#define SUB ( rD, rA, rB ) 
ttdefine SUB C( rD, rA, rB ) 
#define SUBFIC( rD, rA, SIMM ) 
#define SUBI ( rD, rA, SIMM ) 
#define SUBIC C( rD, rA, SIMM ) 
ttdefine SUBIS( rD, rA, SIMM ) 
#define TEST_COUNT( label ) 
#define XOR ( rA, rS, rB ) 
#define XOR C( rA, rS, rB ) 
#define XORI ( rA, rS, UIMM ) 
#define XORIS( rA, rS, UIMM ) 

/* 

* VMX instructions 
*/ 

tfdefine BR VMX ALL TRUE ( label ) 
#define BR VMX ALL FALSE { label ) 
tfdefine BR VMX NONE TRUE { label ) 
#define BR VMX SOME FALSE ( label ) 
tfdefine BR VMX SOME_TRUE( label ) 
ttdefine DSS ( STRM ) 
tfdefine DSS ALL 
ffdefine DST( rA, rB, STRM ) 
tfdefine DSTST( rA, rB, STRM ) 
tfdefine DSTT ( rA, rB, STRM ) 
tfdefine DSTSTT ( rA, rB, STRM ) 
^define LVEBX( vT, rA, rB ) 
tfdefine LVEHX ( vT, rA, rB ) 
fldefine LVEWX ( vT, rA, rB ) 

#if defined ( LITTLE ENDIAN ) 

#define LVSL( vT, rA, rB ) 

#define LVSR( vT, rA, rB ) 
#else 

#define LVSL{ vT, rA, rB ) 

tfdefine LVSR( vT, rA, rB ) 
#endif 
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slwi rA, rS, (SH) ; 
slwi. rA, rS, (SH) ; 
sraw rA, rS, rB; 
sraw. rA, rS, rB; 
srawi rA, rS, (SH) ; 
srawi . rA, rS, (SH) ; 
srw rA, rS, rB; 
srw. rA, rS, rB; 
srwi rA, rS, (SH) ; 
srwi . rA, rS, (SH) ,- 
stb rS, (d) (rA) ; 
stbu rS, (d) (rA) ; 
stbux rS, rA, rB; 
stbx rS, rA, rB; 
stfd frD, (d) (rA) ; 
stfdu frD, (d) (rA) ; 
stfdux frD, rA, rB; 
stfdx frD, rA, rB; 
stfs frD, (d) (rA) ; 
stfsu frD, (d) (rA) ; 
stfsux frD, rA, rB; 
stfsx frD, rA, rB; 
sth rS, (d) (rA) ; 
sthu rS, (d) (rA) ; 
sthux rS, rA, rB; 
sthx rS, rA, rB; 
stw rS, (d) (rA) ; 
stwu rS, (d) (rA) ; 
stwux rS, rA, rB; 
stwx rS, rA, rB; 
sub rD, rA, rB; 
sub. rD, rA, rB; 
subf ic rD, rA, (SIMM) ; 
subi rD, rA, (SIMM) ; 
subic. rD, rA, (SIMM) ; 
subis rD, rA, (SIMM) ; 
bdnz label; 
xor rA, rS, rB; 
xor. rA, rS, rB; 
xori rA, rS, (UIMM) ; 
xoris rA, rS, (UIMM) ; 



bt 24, label; 

bt 26, label; 

bt 26, label; 

bf 24, label; 

bf 26, label; 

dss STRM, 0; 

dss 0, 1; 

dst rA, rB, STRM; 

dstst rA, rB, STRM; 

dstt rA, rB, STRM; 

dststt rA, rB, STRM; 

lvebx vT, rA, rB; 

lvehx vT, rA, rB; 

lvewx vT, rA, rB; 



lvsr vT, rA, rB; 

Ivsl vT, rA, rB; 

Ivsl vT, rA, rB; 

lvsr vT, rA, rB; 
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vB 



tfdefine LVX( vT, rA, rB ) 
#define LVXL( vT, rA, rB ) 
tfdefine STVEBX ( vS, rA, rB ) 
tfdefine STVEHX ( vS, rA # rB ) 
tfdefine STVEWX ( vS, rA, rB ) 
tfdefine STVX( vS, rA, rB ) 
#define STVXL( vS, rA, rB ) 
tfdefine VADDFP ( vT, vA, vB ) 
tfdefine VADDSBS( vT, vA, vB 
^define VADDSHS( vT, vA, vB 
tfdefine VADDSWS( vT, vA, 
tfdefine VADDUBM ( vT, vA, 
tfdefine VADDUBS( vT, vA, 
#define VADDUHM ( vT, vA, vB 
#define VADDUHS( vT, vA, vB 
#define VADDUWM( vT, vA, vB 
#define VADDUWS( vT, vA, vB 
^define VAND ( vT, vA, vB ) 
#define VANDC( vT, vA, vB ) 
#define VCMPEQFP( vT, vA, vB ) 
#define VCMPEQFP C( vT, vA, vB 
tfdefine VCMPEQUB ( vT, vA, vB ) 
#define VCMPEQUB C( vT, vA, vB 
#define VCMPEQUH( vT, vA, vB ) 
#define VCMPEQUH C( vT, vA, vB 
tfdefine VCMPEQUW( vT, vA, VB ) 
#define VCMPEQUW C( vT, vA, vB 
#define VCMPGEFP ( vT, vA, vB ) 
tfdefine VCMPGEFP C( vT, vA, vB 
^define VCMPGTFP ( vT, vA, vB ) 
tfdefine VCMPGTFP C( vT, vA, vB 
tfdefine VCMPGTSB{ vT, vA, vB ) 
tfdefine VCMPGTSB C( vT, vA, vB 
tfdefine VCMPGTSH{ vT, vA, vB ) 
ttdefine VCMPGTSH C( vT, vA, vB 
#define VCMPGTSW( vT, vA, vB ) 
#define VCMPGTSW C( vT, vA, vB 
tfdefine VCMPGTUB( vT, vA, vB ) 
#define VCMPGTUB C( vT, vA, vB 
#define VCMPGTUH( vT, vA, vB ) 
^define VCMPGTUH C( vT, vA, vB 
tfdefine VCMPGTUW( vT, vA, vB ) 
#define VCMPGTUW C( vT, vA, vB 
ttdefine VCFSX ( vT, vB, UIMM ) 
tfdefine VCFUX ( vT, vB, UIMM ) 
tfdefine VCTSXS ( vT, vB, UIMM ) 
#define VCTUXS ( vT, vB, UIMM ) 
^define VEXPTEFP( vT, vB ) 
#define VL0GEFP( vT, vB ) 
tfdefine VMADDFP ( vT, vA, 
#define VMAXFP( vT, vA, vB ) 
#define VMAXSB ( vT, vA, vB ) 
#define VMAXSH( vT, vA, 
#define VMAXSW{ vT, vA, 
^define VMAXUB{ vT, vA, vB ) 
fldefine VMAXUH ( vT, vA, vB ) 
#define VMAXUW ( vT, vA, vB ) 
#define VMHADDSHS ( vD, vA, vB, vC ) 
^define VMHRADDSHS ( vD, vA, vB, vC ) 
#define VMINFP( vT, vA, vB ) 
#define VMINSB( vT, vA, vB ) 
#define VMINSH ( vT, vA, vB ) 
#define VMINSW ( vT, vA, vB ) 
#define VMINUB ( vT, vA, vB ) 
^define VMINUH( vT, vA, vB ) 
#define VMINUW( vT, vA, vB ) 



vC, vB ) 



vB ) 
vB ) 



Ivx vT, rA, rB; 
lvxl vT, rA, rB; 
stvebx vS, rA, rB; 
stvehx vS, rA, rB; 
stvewx vS, rA, rB; 
stvx vS, rA, rB; 
stvxl vS, rA, rB; 
vaddfp vT, vA, vB; 
vaddsbs vT, vA, vB; 
vaddshs vT, vA, vB; 
vaddsws vT, vA, vB; 
vaddubm vT, vA, vB; 
vaddubs vT, vA, vB; 
vadduhm vT, vA, vB; 
vadduhs vT, vA, vB; 
vadduwm vT, vA, vB; 
vadduws vT, vA, vB; 
vand vT, vA, vB; 
vandc vT, vA, vB; 
vcmpeqfp vT, vA, vB; 
vcmpeqfp. vT, vA, vB; 
vcmpequb vT, vA, vB; 
vcmpequb. vT, vA, vB; 
vcmpequh vT, vA, vB; 
vcmpequh. vT, vA, vB; 
vcmpequw vT, vA, vB; 
vcmpequw. vT, vA, vB; 
vcmpgefp vT, vA, vB; 
vcmpgefp. vT, vA, vB; 
vcmpgtfp vT, vA, vB; 
vcmpgtfp. vT, vA, vB; 
vcmpgtsb vT, vA, vB; 
vcmpgtsb. vT, vA, vB; 
vcmpgtsh vT, vA, vB; 
vcmpgtsh. vT, vA, vB; 
vcmpgtsw vT, vA, vB; 
vcmpgtsw. vT, vA, vB; 
vcmpgtub vT, vA, vB; 
vcmpgtub. vT, vA, vB; 
vcmpgtuh vT, vA, vB; 
vcmpgtuh. vT, vA, vB; 
vcmpgtuw vT, vA, vB; 
vcmpgtuw. vT, vA, vB; 
vcfsx vT, vB, (UIMM) ; 
vcfux vT, vB, (UIMM) ; 
vctsxs vT, vB, (UIMM) ; 
vctuxs vT, vB, (UIMM); 
vexptefp vT, vB; 
vlogefp vT, vB; 
vmaddfp vT, vA, vC, vB; 
vmaxfp vT, vA, vB; 
vmaxsb vT, vA, vB; 
vmaxsh vT, vA, vB; 
vmaxsw vT, vA, vB; 
vmaxub vT, vA, vB; 
vmaxuh vT, vA, vB; 
vmaxuw vT, vA, vB; 
vmhaddshs vD, vA, vB, vC; 
vmhraddshs vD, vA, vB, vC 
vminfp vT, vA, vB; 
vminsb vT, vA, vB; 
vminsh vT, vA, vB; 
vminsw vT, vA, vB; 
vminub vT, vA, vB; 
vminuh vT, vA, vB; 
vminuw vT, vA, vB; 
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tfdefine VMLADDUHM ( vD, vA, vB, vC ) vmladduhm vD, vA, vB # vC; 
tfdefine VMR( vD, vS ) vor vD, vS, vS; 



#if defined ( LITTLE ENDIAN 



#def ine 


VMRGHB ( 


vT f 


~vA, 


vB ) 


vmrglb 


vT, 


vB, 


VA; 


#def ine 


VMRGHH { 


vT f 


vA, 


vB ) 


vmrglh 


vT, 


vB f 


VA; 


#def ine 


VMRGHW { 


vT, 


vA f 


vB ) 


vmrglw 


vT, 


vB, 


vA; 


#def ine 


VMRGLB ( 


vT, 


vA, 


vB ) 


vmrghb 


vT, 


vB # 


vA; 


#def ine 


VMRGLH ( 


vT, 


vA, 


vB ) 


vmrghh 


vT, 


vB, 


vA; 


#def ine 


VMRGLW ( 


vT, 


vA, 


vB ) 


vmrghw 


vT, 


vB # 


vA; 


#else 


















#def ine 


VMRGHB ( 


vT, 


vA, 


vB ) 


vmrghb 


vT, 


vA, 


VB; 


#def ine 


VMRGHH ( 


vT, 


vA, 


vB ) 


vmrghh 


vT, 


vA, 


vB; 


#def ine 


VMRGHW ( 


vT, 


vA, 


vB ) 


vmrghw 


vT, 


vA, 


vB; 


#def ine 


VMRGLB ( 


vT, 


vA, 


vB ) 


vmrglb 


vT, 


vA, 


vB; 


#def ine 


VMRGLH ( 


vT, 


vA, 


vB ) 


vmrglh 


vT, 


vA f 


vB; 


#def ine 


VMRGLW { 


vT, 


vA, 


vB ) 


vmrglw 


vT, 


vA, 


vB; 



tfendif 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdef ine 

#if def 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
JJdef ine 
#def ine 
#def ine 
#else 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
tfendif 



VMSUMMBM 
VMSUMSHM 
VMSUMSHS 
VMSUMUBM 
VMSUMUHM 
VMSUMUHS 
VMULESB ( 
VMULESH ( 
VMULEUB ( 
VMULEUH ( 
VMULOSB ( 
VMULOSH ( 
VMULOUB( 
VMULOUH ( 
VNMSUBFP 
VNOR( vT 
VNOT( vT 
VOR( vT, 



( vT, 
( vT, 
( vT, 
( vT, 
( vT, 
( vT, 
vT, 
vT, 
vT, 
vT, 
vT, 
vT, 
vT, 
vT, 
( vT, 
, vA, 
, vA 
vA # 



vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA # vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 
vA, vC, vB ) 
vB ) 

) 

vB ) 



ined( LITTLE ENDIAN ) 
VPERM( vT, vA, vB # vC ) 
VPKUHUM( vT, vA, vB ) 
VPKUHUS{ vT, vA, vB ) 
VPKSHUS( vT, vA, vB ) 
VPKSHSS( vT # vA, vB ) 
VPKUWUM( vT, vA, vB ) 
VPKUWUS( vT, vA, vB ) 
VPKSWUS( vT, vA, vB ) 
VPKSWSS( vT, vA, vB ) 

VPERM( vT, vA, vB, vC ) 



VPKUHUM( vT, 

VPKUHUS( vT, 

VPKSHUS ( vT, 

VPKSHSS( vT, 

VPKUWUM( vT, 

VPKUWUS( vT, 

VPKSWUS( vT, 

VPKSWSS( vT, 



vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA # vB ) 

vA, vB ) 



vmsummbm vT, 
vmsumshm vT, 
vmsumshs vT, 
vmsumubm vT, 
vmsumuhm vT, 
vmsumuhs vT, 
vmulesb vT, 
vmulesh vT, 
vmuleub vT, 
vmuleuh vT, 
vmulosb vT, 
vmulosh vT, 
vmuloub vT, 
vmulouh vT, 
vnmsubfp vT # 
vnor vT, vA, 
vnor vT, vA, 
vor vT, vA, 



vA, vB, vC; 

vA, vB, vC; 

vA, vB, vC; 

vA, vB, vC; 

vA, vB, vC; 

vA, vB, vC; 
vA, vB; 
vA, vB; 
vA, vB; 
vA, vB; 
vA, vB; 
vA f vB; 
vA, vB; 
vA , vB ; 

vA, vC, vB; 

vB; 

vA; 
VB; 



vperm vT, vB, vA, vC; 

vpkuhum vT, vB, vA; 

vpkuhus vT, vB, vA; 

vpkshus vT f vB, vA; 

vpkshss vT, vB, vA; 

vpkuwum vT, vB, vA; 

vpkuwus vT, vB, vA; 

vpkswus vT, vB, vA; 

vpkswss vT, vB # vA; 

vperm vT, vA, vB, vC; 

vpkuhum vT, vA, vB; 

vpkuhus vT, vA # vB; 

vpkshus vT, vA, vB; 

vpkshss vT, vA, vB; 

vpkuwum vT, vA, vB; 

vpkuwus vT, vA, vB; 

vpkswus vT, vA, vB; 

vpkswss vT, vA, vB; 



Udefine VREFP( vT, vB ) vrefp vT, vB; 

ltdefine VRFIM( vT, vB ) vrfim vT, vB; 

Udefine VRFIN( vT, vB ) vrfin vT, vB; 

fldefine VRFIP( vT, vB ) vrfip vT, vB; 

^define VRFIZ{ vT, vB ) vrfiz vT, vB; 

^define VRLB ( vT, vA, vB ) vrlb vT, vA, vB; 

^define VRLH ( vT, vA, vB ) vrlh vT, vA, vB; 
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tfdefine VRLW( vT, vA, vB ) 
#define VRSQRTEFP ( vT, vB ) 
tfdefine VSEL( vT, vA, vB # vC ) 
Udefine VSL{ vT f vA, vB ) 

#if defined ( LITTLE-ENDIAN ) 
tfdefine VSLDOI ( vT, vA, vB, UIMM ) 
#else 

ttdefine VSLDOI ( vT, vA, vB, UIMM ) 
tfendif 

tfdefine VSLB ( vT, vA, vB ) 
ttdefine VSLH ( vT, vA, vB ) 
tfdefine VSL0< vT, vA, vB ) 
tfdefine VSLW ( vT, vA, vB ) 
tfdefine VSR( vT, vA, vB ) 
#define VSRAB( vT, vA f vB ) 
tfdefine VSRAH( vT, vA, vB ) 
^define VSRAW( vT, vA, vB ) 
^define VSRB { vT, vA, vB ) 
ttdefine VSRH{ vT, vA, vB ) 
tfdefine VSRO ( vT, vA, vB ) 
#define VSRW ( vT, vA, vB ) 
#define VSPLTB ( vT, vB, UIMM ) 
#define VSPLTH{ vT, vB, UIMM ) 
^define VSPLTW{ vT, vB # UIMM ) 
#define VSPLTISB( vT, SIMM ) 
tfdefine VSPLTISH( vT, SIMM ) 
tfdefine VSPLTISW( vT, SIMM ) 
tfdefine VSUBFP( vT, vA, vB ) 
#define VSUBSBS { vT, vA, vB 
fldefine VSUBSHS ( vT, vA, vB 
tfdefine VSUBSWS ( vT, vA, vB 
^define VSUBUBM ( vT, vA, vB 
tfdefine VSUBUBS ( vT, vA, vB 
#define VSUBUHM ( vT, vA, vB 
fldefine VSUBUHS ( vT, vA, vB 
#define VSUBUWM{ vT, vA, vB 
tfdefine VSUBUWS ( vT, vA, vB 
tfdefine VSUMSWS ( vT, vA, vB 
#define VSUM2SWS( vT, vA, vB ) 
tfdefine VSUM4SBS( vT, vA, vB ) 
#define VSUM4SHS( vT, vA, vB ) 
tfdefine VSUM4UBS( vT, vA, vB ) 

#if defined ( LITTLE ENDIAN ) 
#define VUPKHSB ( vT, vB ) 
tfdefine VUPKHSH ( vT, vB ) 
tfdefine VUPKLSB { vT, vB ) 
^define VUPKLSH { vT, vB ) 
tfelse 

^define VUPKHSB { vT, vB ) 
^define VUPKHSH ( vT, vB ) 
^define VUPKLSB ( vT, vB ) 
#define VUPKLSH ( vT, vB ) 
flendif 

tfdefine VXOR( vT f vA, vB ) 
/* 

* stack and register macros 
*/ 

tfdefine VRSAVE_COND 7 
ttundef V0LATILE_rl3 
tfdefine MIN STACK ALIGN 16 



vrlw vT, vA, vB; 
vrsqrtefp vT, vB; 
vsel vT # vA, vB, vC; 
vsl vT, vA, vB; 



vsldoi vT, vB, vA, (16 - (UIMM)); 
vsldoi vT, vA, vB, (UIMM) ; 



vslb vT, vA, vB; 
vslh vT, vA, vB; 
vslo vT, vA, vB; 
vslw vT, vA, vB; 
vsr vT, vA, vB; 
vsrab vT, vA, vB; 
vsrah vT, vA, vB; 
vsraw vT, vA, vB; 
vsrb vT, vA, vB 
vsrh vT, vA, vB 
vsro vT, vA, vB 
vsrw vT, vA, vB 
vspltb vT, vB, C 
vsplth vT, vB, S 
vspltw vT, vB, L 
vspltisb vT, (SIMM) 
vspltish vT, (SIMM) 
vspltisw vT, (SIMM) 
vsubfp vT, vA, vB; 
vsubsbs vT, vA, vB; 
vsubshs vT, vA, vB; 
vsubsws vT, vA, vB; 
vsububm vT, vA, vB; 
vsububs vT, vA, vB; 
vsubuhm vT, vA, vB; 
vsubuhs vT, vA, vB; 
vsubuwm vT, vA, vB; 
vsubuws vT, vA, vB; 
vsumsws vT, vA, vB; 
vsum2sws vT, vA, vB 
vsum4sbs vT, vA, vB 
vsum4shs vT # vA, vB 
vsum4ubs vT, vA, vB 



INDEX MUNGE ( UIMM ) ; 
INDEX MUNGE ( UIMM ) ; 
INDEX MUNGE ( UIMM ) ; 



vupklsb vT, vB; 

vupklsh vT, vB; 

vupkhsb vT, vB; 

vupkhsh vT, vB; 

vupkhsb vT, vB; 

vupkhsh vT, vB; 

vupklsb vT, vB; 

vupklsh vT, vB; 



vxor vT, vA # vB; 



/* recommended VR condition bit */ 
/* rl3 volatile or non-volatile */ 
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fldefine MI N_STACK__ALIGN_MASK (MIN_STACK__ALIGN - 1) 

fldefine ALIGN STACK ( nbytes ) \ 

(((nbytes) + MIN_STACK_ALIGN_MASK) & ~ M I N_STAC K_AL I GN_MAS K ) 

tfdefine LR SAVE OFF 4 

tfdefine FPR_SAVE_OFF (- (32-14) *8) 

#if defined { VOLATILE_rl3 ) 

fldefine GPR_SAVE_OFF (FPR_SAVE_OFF - (32-14) *4) 
tfelse 

^define GPR_SAVE_OFF (FPR_SAVE_OFF - (32-13) *4) 
#endif 

^define CR_SAVE_OFF (GPR_SAVE_OFF - 4) 
#if defined ( BUILD_MAX ) 

#define VRSAVE_SAVEJDFF (CR_SAVE_OFF - 4) 
#if defined ( VOLATILE rl3 ) 

tfdefine ALIGNMENT_PADDING_OFF (VRSAVE_SAVE_OFF - 0) 
#else 

^define ALIGNMENT_PADDING_OFF (VRSAVE_SAVE_OFF - 12) 
tfendif 

fldefine VR SAVE OFF <ALIGNMENT_PADDING_OFF - (32-20) *16) 
^define LAST_OFF VR_SAVE_OFF 

#else 

tfdefine LAST_OFF CR_SAVE_OFF 
#endif 

#define REG SAVE SIZE ( -LAST_OFF) 
tfdefine MAX NARGS 18 
tfdefine ARGS SIZE (MAX__NARGS * 4) 
#define LINK SIZE 8 

#define STACK_FRAME_SIZE (REG_SAVE_SIZE + ARGS_SIZE + LINK_SIZE) 
/* 

* macros to obtain the byte offset into the stack for the last FPR 

* and GPR registers for small temporary storage. 

* FPR_SAVE AREA OFFSET points to an area of 8 * (# of unsaved non-volatile 

* FPR registers) . 

* GPR_SAVE AREA OFFSET points to an area of 4 * (# of unsaved non- volatile 

* GPR registers) . 

* GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET_GPR_SAVE_AREA places the start of the GPR save area into a register 
* 

* For MAX only: 
★ 

* VR_SAVE AREA OFFSET points to an area of 16 * (# of unsaved non-volatile 

* VR registers) . 

* GET VR SAVE AREA places the start of the VR save area into a register 
*/ 

#define FPR SAVE AREA OFFSET FPR SAVE OFF 
#define GPR_SAVE_AREA_OFFSET GPR_SAVE_OFF 

tfdefine GET FPR SAVE AREA ( ptr ) \ 

addi ptr, sp, FPR_SAVE_AREA__OFFSET; 

tfdefine GET GPR SAVE AREA ( ptr ) \ 

addi ptr, sp, GPR_SAVE_AREA_OFFSET ; 

#if defined ( BUILD_MAX ) 
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#def ine VR_SAVE_AREA_OFFSET VR_SAVE_OFF 

tfdefine GET VR SAVE AREA ( ptr ) \ 

addi ptr/ sp, VR SAVE AREA OFFSET; 
#endif 

/* 

* if the function creates a stack frame with local storage, 

* LOCAL STORAGE OFFSET is the stack offset to the start of this 

* storage and is guaranteed to have the minimum stack alignment. 
*/ 

fldefine LOCAL_STORAGE_OFFSET (LINK_SIZE + ARGS_SIZE) 
/* 

* macros to create and destroy a stack frame. 
* 

* CREATE_STACK FRAME [ X] creates a stack frame that can handle up to 

* 18 GPR register arguments and a local storage size <= 

* 32768 - 512 = 32,256 bytes. 

* CREATE_STACK_FRAME_X destroys rO . 

* For CREATE_STACK_FRAME_X , local_nbytes_reg must not be rO . 

* Both CREATE STACK FRAME [ X] and DESTROY STACK FRAME should not be 

* called before registers are saved or after they are restored. 
* 

* The stack pointer "output from" CREATE STACK_FRAME [_X] must be 

* the same "input to" DESTROY_STACK FRAME. 
*/ 

# define CREATE STACK FRAME ( local nbytes ) \ 

Stwu sp, -ALIGN__STACK ( STACK_FRAME_S I ZE + (local_nbytes) ) (sp) ; 

tfdefine CREATE STACK FRAME X( local nbytes reg ) \ 

addi rO, local nytes reg, { STACK FRAME_SIZE + MIN_STACK_ALIGN_SIZE) ; \ 
andi. rO, rO, -MI N_STACK_AL I GN_MAS K ; \ 
stwux sp, sp, rO; 

#define DESTROY STACK_FRAME \ 
Iwz sp, 0 (sp) ; 

/* 

* macros to allocate and free space on the user stack. 

* with a fixed alignment of MIN STACK ALIGN. 

* nbytes must be <= (32768 - 432 = 32,336). 

* On return, sp points to a buffer of nbytes bytes. 
*/ 

#define PUSH STACK ( nbytes ) \ 

addi sp, sp, -ALIGN_STACK ( REG_SAVE_SIZE + (nbytes) ) ; 

#define POP_STACK( nbytes ) \ 

addi sp, sp, ALIGN_STACK( REG_SAVE_SIZE + (nbytes) ); 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
mr ptr, sp; 

#define FREE_STACK_SPACE ( nbytes ) POP_STACK( nbytes ) 
/* 

* macros to create and destroy a stack buffer with a variable 

* alignment and size. 
* 

* CREATE STACK BUFFER [ X] creates a buffer of size nbytes and alignment 

* byte align on the stack, returning a pointer to the buffer in the 

* GPR bufferp. 
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* buf ferp must be a GPR other than rO and rl (sp) . 

* byte align must be a power of 2 such that 2 <= byte_align <= 4096. 

* CREATE_STACK_BUFFER destroys r0 . 
* 

* CREATE STACK BUFFER [ X] stores the original value of the stack pointer 

* below the buffer at offset 0 from the new stack pointer. 
* 

* DESTROY STACK BUFFER sets the stack pointer to the value stored 

* at the address pointed to by the input stack pointer. 
* 

* Both CREATE STACK BUFFER [ X] and DESTROY STACK BUFFER should not be 

* called before registers are saved or after they are restored. 
* 

* The stack pointer "output from" CREATE STACK_BUFFER [_X] must be 

* the same "input to" DESTROY_STACK_BUFFER . 
*/ 

tfdefine CREATE STACK BUFFER ( buf ferp, byte align, nbytes ) \ 

addis buf ferp, sp, (-(REG SAVE SIZE + (nbytes)) + 32768)@h; \ 

li rO, (((byte align) - 1) | MIN STACK ALIGN MASK); \ 

addi buf ferp, buf ferp, (- (REG_SAVE_SIZE + (nbytes) )) @1 ; \ 

andc buf ferp, buf ferp, rO; \ 

sub rO, buf ferp, sp; \ 

addic rO, rO, -MIN_STACK_ALIGN ; \ 

stwux sp, sp, rO; 

#define CREATE STACK BUFFER X( buf ferp, byte_align, nbytes_reg ) \ 
sub buf ferp, sp, nbytes_reg; \ 

li rO, (((byte align) - 1) | MIN STACK_ALIGN_MASK) ; \ 

addi bufferp, bufferp, -REG SAVE_SIZE; \ 

andc bufferp, bufferp, rO ; \ 

sub rO, bufferp, sp; \ 

addic rO, rO, -MI N_S TACK_AL I GN ; \ 

stwux sp, sp, rO ; 

^define DESTROY STACK_BUFFER \ 
lwz Sp, 0 (sp) ; 

/* 

* macros to create and destroy the salcache buffer on the user stack. 
* 

* CREATE_STACK_SALCACHE destroys rO . 
★ 

* Both CREATE STACK SALCACHE and DESTROY STACK SALCACHE should not be 

* called before registers are saved or after they are restored. 
*/ 

tfdefine CREATE STACK SALCACHE ( cachep ) \ 

CREATE_STACK_BUFFER ( cachep, SALCACHE_ALIGN, SALCACHE_ALLOC_SIZE ) 

#define DESTROY_STACK_SALCACHE DESTROY_STACK_BUFFER 

/* 

* macros for saving and restoring non-volatile 

* floating point registers (FPRs) 
*/ 
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opcode fl4, (FPR_SAVE OFF + 17*8) (sp) ; 
tfdefine SR fl4_fl5( opcode ) \ 

opcode fl5, (FPR_SAVE_OFF + 16*8) (sp) ; \ 

SR f 14 ( opcode ) 
fldefine SR fl4_fl6( opcode ) \ 

opcode £16, (FPR SAVE_OFF + 15*8) (sp) ; \ 

SR fl4 fl5( opcode ) 
tfdefine SR f 14_f 17 ( opcode ) \ 

opcode fl7, (FPR SAVE_OFF + 14*8) (sp) ; \ 

SR fl4 fl6( opcode ) 
#define SR fl4_fl8( opcode ) \ 

opcode fl8, (FPR SAVE_OFF + 13*8) (sp) ; \ 

SR fl4 fl7( opcode ) 
tfdefine SR fl4_fl9( opcode ) \ 

opcode fl9, (FPR SAVE_OFF + 12*8) (sp) ; \ 

SR fl4 fl8( opcode ) 
#define SR fl4_f20( opcode ) \ 

opcode £20, (FPR SAVEJDFF + 11*8) (sp); \ 

SR fl4 fl9( opcode ) 
^define SR fl4_f21( opcode ) \ 

opcode f21, (FPR SAVEJDFF + 10*8) (sp) ; \ 

SR fl4 f20( opcode ) 
fldefine SR fl4_f22( opcode ) \ 

opcode f22, (FPR SAVE_OFF + 9*8) (sp) ; \ 

SR fl4 f21( opcode ) 
tfdefine SR f 14_f23 ( opcode ) \ 

opcode f23, (FPR SAVE_OFF + 8*8) (sp); \ 

SR fl4 f22( opcode ) 
^define SR f 14_f 24 ( opcode ) \ 

opcode f24, (FPR SAVE_OFF + 7*8) (sp) ; \ 

SR fl4 f23( opcode ) 
fldefine SR fl4_f25( opcode ) \ 

opcode f25, (FPR SAVE_OFF + 6*8) (sp) ; \ 

SR f 14 f 24 ( opcode ) 
^define SR fl4_f26( opcode ) \ 

opcode f26, (FPR SAVE_OFF + 5*8) (sp) ; \ 

SR fl4 f25( opcode ) 
tfdefine SR fl4_f27( opcode ) \ 

opcode f27, (FPR SAVE_OFF + 4*8) (sp) ; \ 

SR fl4 f26( opcode ) 
tfdefine SR fl4_f28( opcode ) \ 

opcode f28, (FPR SAVE_OFF + 3*8) (sp) ; \ 

SR fl4 f27( opcode ) 
tfdefine SR fl4_f29( opcode ) \ 

opcode f29, (FPR SAVE_OFF + 2*8) (sp) ; \ 

SR fl4 f28( opcode ) 
#define SR fl4_f30( opcode ) \ 

opcode f30, (FPR SAVEJDFF + 1*8) (sp) ; \ 

SR f'14 f29( opcode ) 
tfdefine SR fl4_f31( opcode ) \ 

opcode f31, (FPR SAVE_OFF) (sp) ; \ 

SR_fl4_f30( opcode ) 

/* 

* macros for saving and restoring non-volatile 

* general purpose registers (GPRs) 
*/ 

#if defined ( VOLATILE_rl3 ) 
#define SAVE rl3 

#define SAVE rl3 rl4 SR rl4 ( stw ) 
fldefine SAVE rl3 rl5 SR rl4 rl5( stw ) 
fldefine SAVE rl3 rl6 SR rl4 rl6( stw ) 
#define SAVE rl3 rl7 SR rl4 rl7 ( stw ) 
#define SAVE rl3 rl8 SR rl4 rl8( stw ) 
fldefine SAVE rl3 rl9 SR rl4 rl9( stw ) 
tfdefine SAVE rl3 r20 SR rl4 r20( stw ) 
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tfdefine SAVE rl3 
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/* rl3 is non-volatile */ 
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tfdefine REST rl3 r26 SR rl3 r26( lwz ) 

#define REST rl3 r27 SR rl3 r27 ( lwz ) 

#define REST rl3 r28 SR rl3 r28( lwz ) 

^define REST rl3 r29 SR rl3 r29( lwz ) 

tfdefine REST rl3 r30 SR rl3 r30{ lwz ) 

tfdefine REST_rl3_r31 SR_rl3_r31( lwz ) 

/* 

* macros common to both GPR save and restore 
*/ 

tfdefine SR rl3 ( opcode ) \ 

opcode rl3, (GPR_SAVE OFF + 18*4) (sp) ; 
#def ine SR rl3_rl4 ( opcode ) \ 

opcode rl4, (GPR_SAVE_OFF + 17*4) (sp) ; \ 

SR rl3( opcode ) 
#define SR rl3_rl5( opcode ) \ 

opcode rl5, (GPR SAVEJ3FF + 16*4) (sp) ; \ 

SR rl3 rl4 ( opcode ) 
tfdefine SR rl3_rl6( opcode ) \ 

opcode rl6, (GPR SAVE_OFF + 15*4) (sp) ; \ n 

SR rl3 rl5 ( opcode ) 
#define SR rl3_rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_OFF + 14*4) (sp) ; \ 

SR rl3 rl6 ( opcode ) 
#define SR rl3_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl3 rl7 ( opcode ) 
tfdefine SR rl3_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl3 rl8 ( opcode ) 
#define SR rl3_r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 11*4) (sp) ; \ 

SR rl3 rl9( opcode ) 
tfdefine SR rl3_r21 ( opcode ) \ 

opcode r21, (GPR SAVEJDFF + 10*4) (sp); \ 

SR rl3 r20( opcode ) 
^define SR rl3_r22( opcode ) \ 

opcode r22, (GPR SAVE_OFF + 9*4) (sp) ; \ 

SR rl3 r21 ( opcode ) 
#define SR rl3_r23 ( opcode ) \ 

opcode r23, (GPR SAVE_OFF + 8*4) (sp) ; \ 

SR rl3 r22 ( opcode ) 
#def ine SR rl3_r24 ( opcode ) \ 

opcode r24, (GPR SAVE_OFF + 7*4) (sp) ; \ 

SR rl3 r23 ( opcode ) 
#define SR rl3_r25( opcode ) \ 

opcode r25, (GPR SAVE_OFF + 6*4) (sp) ; \ 

SR rl3 r24 ( opcode ) 
tfdefine SR rl3_r26( opcode ) \ 

opcode r26, (GPR SAVE_OFF + 5*4) (sp); \ 

SR rl3 r25 ( opcode ) 
tfdefine SR rl3_r27( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl3 r26 ( opcode ) 
tfdefine SR rl3_r28( opcode ) \ 

opcode r28, (GPR SAVE_0FF + 3*4) (sp); \ 

SR rl3 r27 ( opcode ) 
^define SR rl3_r29( opcode ) \ 

opcode r29, (GPR SAVE_OFF + 2*4) (sp); \ 

SR rl3 r28 ( opcode ) 
^define SR rl3_r30( opcode ) \ 

opcode r30, (GPR SAVE_0FF + 1*4) (sp) ; \ 

SR rl3 r29{ opcode ) 
^define SR rl3_r31( opcode ) \ 

opcode r31, (GPR SAVEJDFF) (sp) ; \ 

SR_rl3_r30( opcode ) 
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tfendif 
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3/9/2001 



/* end VOLATILE rl3 */ 



* macros common to both GPR save and restore 
'/ 



#define SR rl4 ( opcode ) \ 

opcode rl4, (GPR_SAVE OFF 
#define SR rl4_rl5( opcode ) 

opcode rl5, (GPR_SAVE_OFF 

SR rl4 ( opcode ) 
ffdefine SR rl4_r!6 ( opcode ) 

opcode rl6, (GPR SAVE_OFF 

SR rl4 rl5 ( opcode ) 
#define SR rl4_rl7{ opcode 

opcode rl7, (GPR SAVE_OFF 

SR rl4 rl6 ( opcode ) 
#define SR rl4_rl8{ opcode ) 

opcode rl8, (GPR SAVE_OFF 

SR rl4 rl7( opcode ) 
tfdefine SR rl4_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp); \ 

SR rl4 rl8( opcode ) 
#define SR rl4_r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 

SR rl4 rl9( opcode ) 
tfdefine SR rl4_r21 ( opcode ) \ 

opcode r21, (GPR SAVE_OFF + 10*4) (sp) ; \ 

SR rl4 r20( opcode ) 
#define SR_rl4_r22 ( opcode ) \ 



17*4) (sp) ; 

16*4) (sp) ; \ 

15*4) (sp) ; \ 

14*4) (sp) ; \ 

13*4) (sp) ; \ 



11*4) (sp) ; \ 
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opcode r22, 
SR rl4 r21 ( 

#define SR rl4 
opcode r23, 
SR rl4 r22 ( 

#define SR rl4_ 
opcode r24, 
SR rl4 r23 ( 

#define SR rl4_ 
opcode r25 , 
SR rl4 r24 ( 

#define SR rl4_ 
opcode r26, 
SR rl4 r25( 

tfdefine SR rl4 
opcode r27, 
SR rl4 r26 ( 

tfdefine SR rl4_ 
opcode r28, 
SR rl4 r27 ( 

#define SR rl4 
opcode r29, 
SR rl4 r28( 

#define SR rl4. 
opcode r30, 
SR rl4 r29( 

tfdefine SR rl4 
opcode r31, 
SR rl4 r30{ 



3/9/2001 



(GPR SAVE_OFF + 9*4) (sp) ; \ 
opcode ) 
r23( opcode ) \ 
"(GPR SAVE_OFF + 8*4) (sp) ; \ 
opcode ) 
r24 ( opcode ) \ 
(GPR SAVE_OFF + 7*4) (sp); \ 
opcode ) 
r25( opcode ) \ 
(GPR SAVE_OFF + 6*4) (sp) ; \ 
opcode ) 
r26( opcode ) \ 
*(GPR SAVE_OFF + 5*4) (sp); \ 
opcode ) 
r27( opcode ) \ 
"(GPR SAVE_OFF + 4*4) (sp); \ 
opcode ) 
r28( opcode ) \ 
" (GPR SAVE_OFF + 3*4) (sp) ; \ 
opcode ) 
r29( opcode ) \ 
(GPR SAVE_OFF + 2*4) (sp) ; \ 
opcode ) 
r30( opcode ) \ 
*(GPR SAVE_OFF + 1*4) (sp) ; \ 
opcode ) 
r31( opcode ) \ 
"(GPR SAVEJ3FF) (sp) ; \ 
opcode ) 
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48 



Page No. 385 



EV 093 931 868 US 
Page No. 412 

salppc . inc 



3/9/2001 



* macros common to both GPR save and restore 



tfdefine SR rl5( opcode ) \ 
opcode rl5, <GPR_SAVE OFF 

ttdefine SR rl5_rl6( opcode ) 
opcode rl6, (GPR_SAVE_OFF 
SR rl5( opcode ) 

#def ine SR rl5_rl7 { opcode ) 
opcode rl7, (GPR SAVE_OFF 
SR rl5 rl6( opcode ) 

#define SR rl5_rl8( opcode ) 
opcode rl8, (GPR SAVE_OFF 
SR rl5 rl7( opcode ) 

#define SR rl5_rl9( opcode ) 
opcode rl9, {GPR SAVEJ3FF 
SR rl5 rl8( opcode ) 

#define SR rl5_r20( opcode ) 
opcode r20, (GPR SAVE_0FF 
SR rl5 rl9{ opcode ) 

#define SR rl5_r21( opcode ) 
opcode r21 # (GPR SAVEJ3FF 
SR rl5 r20( opcode ) 

#define SR rl5_r22( opcode ) 
opcode r22, (GPR SAVEJDFF 
SR rl5 r21( opcode ) 

#define SR rl5_r23{ opcode ) 
opcode r23, (GPR SAVE_OFF 
SR rl5 r22 ( opcode ) 

#def ine SR rl5_r24 ( opcode ) 
opcode r24, (GPR SAVE_OFF 
SR rl5 r23 ( opcode ) 

#define SR rl5_r25( opcode ) 
opcode r25, (GPR SAVE_OFF 
SR rl5 r24 ( opcode ) 

#define SR rl5_r26( opcode ) 
opcode r26, (GPR SAVE_OFF 
SR rl5 r25( opcode ) 

tfdefine SR rl5_r27( opcode ) 
opcode r27, (GPR SAVE_OFF 
SR rl5 r26 ( opcode ) 

tfdefine SR rl5__r28{ opcode ) 
opcode r28, (GPR SAVE_OFF 
SR rl5 r27 ( opcode ) 

#define SR rl5_r29( opcode 
opcode r29, (GPR SAVE_0FF 
SR rl5 r28 ( opcode ) 

#define SR rl5_r30( opcode ) 
opcode r3 0, (GPR SAVE_OFF 
SR rl5 r29( opcode ) 

#define SR rl5_r31( opcode ) 
opcode r31, (GPR SAVE_OFF) 
SR_rl5_r30( opcode ) 



+ 16*4) (sp) ; 
\ 

+ 15*4) (sp) ; \ 
\ 

+ 14*4) (sp) ; \ 
\ 

+ 13*4) (sp) ; \ 
\ 

+ 12*4) (sp) ; \ 
\ 

+ 11*4) (sp) ; \ 

\ 

+ 10*4) (sp) ; \ 



+ 9*4) (sp) ; \ 
\ 

+ 8*4) (sp) ; \ 
\ 

+ 7*4) (sp) ; \ 
\ 

+ 6*4) (sp) ; \ 
\ 

+ 5*4) (sp) ; \ 
\ 

+ 4*4) (sp) ; \ 
\ 



+ 3*4) (sp) ; \ 



) \ 



+ 2*4) (sp) ; \ 
\ 

+ 1*4) (sp) ; \ 
\ 

(sp) ; \ 
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* macros common to both GPR save and restore 



tfdefine SR rl6( opcode ) \ 
opcode rl6, (GPR_SAVE OFF 

#define SR rl6_rl7( opcode ) 
opcode rl7, (GPR_SAVEJDFF 
SR rl6 ( opcode ) 

#define SR rl6_rl8( opcode ) 
opcode rl8, (GPR SAVEJDFF 
SR rl6 rl7( opcode ) 

#define SR rl6_rl9( opcode ) 
opcode rl9, {GPR SAVE_0FF 
SR rl6 rl8( opcode ) 

#define SR rl6_r20( opcode ) 
opcode r20, (GPR SAVE OFF 



SR rl6 rl9( 
#define SR rl6_ 
opcode r21, 
SR rl6 r20 ( 
#define SR rl6_ 
opcode r22, 



opcode ) 
r21 ( opcode ) 
(GPR SAVE_0FF 
opcode ) 
r22( opcode ) 
(GPR SAVE OFF 



SR rl6 r21( opcode ) 
tfdefine SR rl6_r23 ( opcode ) 
opcode r23, (GPR SAVE_0FF 
SR rl6 r22 ( opcode ) 
#def ine SR rl6_r24 ( opcode ) 
opcode r24, (GPR SAVE_0FF 
SR rl6 r23 ( opcode ) 
#define SR rl6_r25( opcode ) 
opcode r25, (GPR SAVE_0FF 
SR rl6 r24 ( opcode ) 
#define SR rl6_r26( opcode ) 
opcode r26, (GPR SAVE_0FF 
SR rl6 r25 ( opcode ) 
^define SR rl6_r27( opcode ) 
opcode r27, (GPR SAVE_0FF 
SR rl6 r26( opcode ) 
#define SR rl6_r28( opcode ) 
opcode r28, (GPR SAVE_0FF 
SR rl6 r27 ( opcode ) 
flde.fine SR rl6_r29( opcode ) 
opcode r29, (GPR SAVE_0FF 
SR rl6 r28{ opcode ) 
#define SR rl6_r30( opcode ) 
opcode r3 0, (GPR SAVE_0FF 
SR_rl6_r29( opcode ) 



15*4) (sp) ; 
14*4) (sp) ; \ 

13*4) (sp) ; \ 

12*4) (sp) ; \ 

11*4) (sp) ; \ 

10*4) (sp) ; \ 

9*4) (sp) ; \ 

8*4) (sp) ; \ 

7*4) (sp) ; \ 

6*4) (sp) ; \ 

5*4) (sp) ; \ 

4*4) (sp) ; \ 

3*4) (sp) ; \ 

2*4) (sp) ; \ 

1*4) (sp) ; \ 
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^define SR rl6_r31( opcode ) \ 

opcode r31, (GPR SAVEJDFF) (sp) ; \ 
SR_rl6_r30( opcode ) 

#if defined ( BUILD_MAX ) 

/* 

* macros for saving and restoring non-volatile 

* vector registers (VRs) 

* (uses rO as scratch register) 
*/ 



tfdef ine 
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SR 
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#def ine 
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#def ine 
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ttdef ine 
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Jfdef ine 
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#def ine 
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JJdef ine 
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#def ine 
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#def ine 
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#def ine 


SAVE 


v20 


v30 


SR 


v20 


v30{ 


stvx 


ttdef ine 


SAVE_ 


_v20_ 


_v31 


SR 


v20 


v31( 


stvx 


#def ine 


REST 


v20 


SR 


v2 0( lvx ) 




#def ine 


REST 


v20 


v21 


SR 


v20 


v21 ( 


lvx ) 


#def ine 


REST 


v20 


v22 


SR 


v20 


v22 ( 


lvx ) 


#def ine 


REST 


v20 


v23 


SR 


v20 


v23 ( 


lvx ) 


#def ine 


REST 


v20 


v24 


SR 


v20 


v24 { 


lvx ) 


#def ine 


REST 


v20 


v25 


SR 


v20 


v25( 


lvx ) 


#def ine 


REST 


v20 


v26 


SR 


v20 


v26 ( 


lvx ) 


#def ine 


REST 


v20 


v27 


SR 


v20 


v27 ( 


lvx ) 


#def ine 


REST 


v20 


v28 


SR 


v20 


v28 ( 


lvx ) 


#def ine 


REST 


v2 0 


v29 


SR 


v20 


v29 ( 


lvx ) 


#def ine 


REST 


v20 


v30 


SR 


v20 


v30( 


lvx ) 


#def ine 


REST 


v20 


v31 


SR 


v2 0 


v31( 


lvx ) 


/* 

















* macros common to both VR save and restore 

* (uses rO as scratch register) 
*/ 

#def ine SR v20 ( opcode ) \ 

li rO, (VR SAVEJDFF + 11*16); \ 

opcode v20, sp, rO; 
^define SR v20 v21( opcode ) \ 

li rO, (VR SAVE_OFF + 10*16); \ 

opcode v21, sp, rO; \ 

SR v20 ( opcode ) 
#def ine SR v20 v22 ( opcode ) \ 

li rO, (VR SAVE_OFF + 9*16) ; \ 

opcode v22, sp, rO; \ 

SR v20 v21 ( opcode ) 
fldefine SR v20 v23 ( opcode ) \ 

li rO, (VR SAVE_OFF + 8*16) ; \ 

opcode v2 3, sp f rO; \ 

SR v20 v22( opcode ) 
tfdefine SR v20 v24 ( opcode ) \ 

li rO, (VR SAVE_OFF + 7*16); \ 

opcode v24, sp, rO; \ 

SR v20 v23( opcode ) 
tfdefine SR v20 v25( opcode ) \ 

li rO, (VR SAVE_OFF + 6*16); \ 

opcode v2 5, sp, rO; \ 

SR v20 v24 ( opcode ) 
#define SR v20 v26( opcode ) \ 

li rO, (VR SAVE_OFF + 5*16); \ 

opcode v26, sp, rO ; \ 
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SR v20 v25( opcode ) 

fldefine SR v20 v27( opcode ) \ 

li rO, (VR SAVE_OFF + 4*16); \ 
opcode v27, sp, rO; \ 
SR v20 v26( opcode ) 

fldefine SR v20 v28( opcode ) \ 

li rO, (VR SAVEJDFF + 3*16); \ 
opcode v28, sp, rO; \ 
SR v20 v27( opcode ) 

#define SR v20 v29( opcode ) \ 

li rO, (VR SAVEJ3FF + 2*16); \ 
opcode v29, sp, rO; \ 
SR v20 v28( opcode ) 

#define SR v20 v30( opcode ) \ 

li rO, (VR SAVEJDFF + 1*16); \ 
opcode v30, sp, rO; \ 
SR v20 v29{ opcode ) 

#define SR v20 v31 ( opcode ) \ 
li rO, (VR SAVE_OFF) ; \ 
opcode v31, sp, rO; \ 
SR_v20_v30 { opcode ) 



* macros for saving, updating and restoring VRSAVE and saving and 

* restoring non-volatile vector registers (vO - v31) 

* (destroys rO and CRO field of CR) 
*/ 

#define NON VOLATILE VR TEST ( last vreg ) \ 

andi. rO, rO, ((-!<< (31 - (last_vreg) ) ) & OxOfff) ; 

#define RECORD vO vl5( last_vreg ) \ 

oris rO, rO, ((-1 << (15 - (last_vreg) ) ) & Oxff ff ) ; \ 
mtspr %VRSAVE, r0; 

#define RECORD vl6 v31( last_vreg ) \ 
oris rO, rO, Oxffff; \ 

ori rO, rO, ((-1 « (31 - (last_vreg) ) ) & Oxffff); \ 
mtspr % VRSAVE, rO; 

#define USE vO vl5( cond, last_vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 8 ); \ 
stw rO, VRSAVE__SAVE OFF(sp); \ 
RECORD__vO_vl5 ( last_vreg ) 

#define USE vl6 vl9( cond, last_vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 8 ); \ 
Stw rO, VRSAVE SAVE OFF(sp); \ 
RECORD_vl6_v31 ( last_vreg ) 

tfdefine FREE_vO_vl9( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET ( 8 ); \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr % VRSAVE, rO; 



/* 

* user-callable macros 
*/ 



#def ine 


USE 


THRU 


vO ( 


cond 


) 


USE 


vO 


vl5( 


cond, 


0 


) 


#def ine 


USE 


THRU 


vl( 


cond 


) 


USE 


vO 


vl5( 


cond, 


1 


) 


#def ine 


USE 


THRU 


v2( 


cond 


) 


USE 


vO 


vl5 ( 


cond, 


2 


) 


#def ine 


USE 


THRU 


v3( 


cond 


) 


USE 


vO 


vl5( 


cond, 


3 


) 


#def ine 


USE 


THRU 


v4 ( 


cond 


) 


USE 


vO 


vl5( 


cond, 


4 


) 
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#def ine 


USE 


THRU 


v5( 


cond ) 


USE 


vO vl5( 


cond, 


5 ) 




#def ine 


USE 


THRU 


v6( 


cond ) 


USE 


vO vl5( 


cond, 


6 ) 




#def ine 


USE 


THRU 


v7( 


cond ) 


USE 


vO vl5( 


cond, 


7 ) 




#def ine 


USE 


THRU 


v8 ( 


cond ) 


USE 


vO vl5( 


cond, 


8 ) 




#def ine 


USE 


THRU 


v9( 


cond ) 


USE 


vO vl5( 


cond, 


9 ) 




#def ine 


USE 


THRU 


vlO( 


cond ) 


USE 


vO vl5( 


cond, 


10 


) 


#def ine 


USE 


THRU 


vll ( 


cond ) 


USE 


VU VlD \ 


cond , 


1 T 
X 1 


V 

) 


ttdef ine 


USE 


THRU 


vl2 ( 


cond ) 


USE 


vO vl5( 


cond, 


12 


) 


#def ine 


USE 


THRU 


vl3 ( 


cond ) 


USE 


vO vl5( 


cond, 


13 


) 


#def ine 


USE 


THRU 


vl4 ( 


cond ) 


USE 


vO vl5( 


cond, 


14 


) 


#def ine 


USE 


THRU 


vl5( 


cond ) 


USE 


vO vl5( 


cond, 


15 


) 


#def ine 


USE 


THRU 


vl6 ( 


cond ) 


USE 


vl6 vl9( 


cond, 


16 


) 


#def ine 


USE 


THRU 


vl7( 


cond ) 


USE 


vl6 vl9( 


cond, 


17 


) 


#def ine 


USE 


THRU 


vl8( 


cond ) 


USE 


vl6 vl9{ 


cond, 


18 


) 


#def ine 


USE_ 


_THRU_ 


_vl9( 


cond ) 


USE 


vl6 vl9( 


cond, 


19 


) 


#def ine 


USE 


THRU 


v20 ( 


cond ) 


\ 










mfspr rO, 


%VRSAVE; 


\ 












cmplwi (cond) 


, rO, 


0; \ 
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beq (cond), PC_0FFSET( 32 ); 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 2 0 ) 
beq PC_OFFSET(16) ; 
\ 

SAVE. v20 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD__vl6_v31 ( 20 ) 

#define USE THRU v21 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (4 0) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_JVOLAT I LE VR TEST ( 21 ) 
beq PC_OFFSET(24) ; 
\ 

SAVE v20 v21 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 21 ) 

#define USE THRU v22 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (48) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 22 ) 
beq PC_OFFSET(32) ; 
\ 

SAVE v20 v22 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 22 ) 

#def ine USE THRU v23 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond) , rO, 0; \ 
beq (cond), PC OFFSET (56) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 23 ) 
beq PC_OFFSET(40) ; 
\ 



/* cond set to equal if VRSAVE 



/* v20 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v20 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v21 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v21 in use */ 



/* cond set to equal if VRSAVE 



/* v20 - v22 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v22 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v23 in use? */ \ 

/* no, cond is set to greater than */ 
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SAVE v20 v23 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 23 ) 

#def ine USE THRU v24 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 64 ) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 24 ) 
beq PC OFFSET (48) ; 
\ 

SAVE v20 v24 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 24 ) 

#define USE THRU v25( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PCJDFFSET (72) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 25 ) 
beq PC OFFSET (56) ; 
\ 

SAVE v20 v25 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 25 ) 

#define USE THRU v26( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (80) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 2 6 ) 
beq PC_OFFSET(64) ; 
\ 

SAVE v20 v26 
.cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 26 ) 

#define USE THRU v27( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (88) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 27 ) 
beq PC_OFFSET(72) ; 
\ 

SAVE v20 v27 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 27 ) 

#define USE THRU v28 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (96) ; 
\ 

Stw rO, VRSAVE_SAVE_OFF(sp) ; \ 
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/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v23 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v24 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v24 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v25 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v25 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v26 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v26 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v27 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO + / \ 

/* indicate vO - v27 in use */ 



/* cond set to equal if VRSAVE = 0 */ 
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NONVOLATILE VR TEST ( 28 ) 

beq PC_OFFSET(80) ; 

\ 

SAVE v20 v28 

cmpwi (cond) , rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 28 ) 

#define USE THRU v29( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (104) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 29 ) 
beq PC_OFFSET<88) ; 
\ 

SAVE v20 v29 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 29.) 

tfdefine USE THRU v30 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET(112) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST { 3 0 ) 
beq PC_OFFSET(96) ; 
\ 

SAVE v20 v30 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 30 ) 

#define USE THRU v31( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET<120) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 31 ) 
beq PC_OFFSET(104) ; 
\ 

SAVE v20 v31 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD vl6 v31 ( 31 ) 
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/* v20 - v28 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v28 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v29 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v29 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v30 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v30 in use */ 



/* cond set to equal if VRSAVE 



/* v20 - v31 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO +/ \ 

/* indicate vO - v31 in use */ 



#def ine 


FREE 


THRU 


v0 ( cond ) 




FREE 


vO 


vl9 ( 


cond 


#def ine 


FREE 


THRU 


vl ( cond ) 




FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


v2 ( cond ) 




FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


v3 ( cond ) 




FREE 


vO 


vl9( 


cond 


#define 


FREE 


THRU 


v4 ( cond ) 




FREE 


vO 


vl9( 


cond 


#define 


FREE 


THRU 


v5 ( cond ) 




FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


v6 ( cond ) 




FREE 


vO 


vl9( 


cond 


#define 


FREE 


THRU 


v7 ( cond ) 




FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


v8 ( cond ) 




FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


v9 ( cond ) 




FREE 


vO 


vl9 ( 


cond 


^define 


FREE 


THRU 


vl0( cond 


) 


FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


vll( cond 


) 


FREE 


vO 


v!9( 


cond 


#def ine 


FREE 


THRU 


vl2 ( cond 


) 


FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


vl3( cond 


) 


FREE 


vO 


vl9( 


cond 


#def ine 


FREE 


THRU 


vl4 ( cond 


) 


FREE 


vO 


vl9 ( 


cond 


#def ine 


FREE 


THRU 


vl5( cond 


) 


FREE 


vO 


vl9 ( 


cond 


^define 


FREE 


THRU 


_yl6( cond 


) 


FREE 


vO 


vl9 ( 


cond 
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tfdefine FREE THRU vl7( cond ) FREE vO vl9( cond ) 
tfdefine FREE THRU vl8 ( cond ) FREE vO vl9 ( cond ) 
#define FREE_THRU_vl 9 ( cond ) FREE_vO_vl9( cond ) 

^define FREEJTHRU_v20 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (20) ; \ 
bgt (cond), PC_OFFSET(12) ; \ 
REST v20; \ 

lwz r0, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v21 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (28) ; \ 
bgt (cond), PC OFFSET (20) ; \ 
REST v20 v21; \ 

lwz r0, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, r0; 

#define FREE_THRU_v22 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (36) ; \ 
bgt (cond), PC OFFSET (28) ; \ 
REST v20 v22; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v23 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (44) ; \ 
bgt (cond), PC OFFSET ( 36 ) ; \ 
REST V20 v23; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#def ine FREE_THRU_v24 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (52) ,- \ 
bgt (cond), PC OFFSET (44) ; \ 
REST v20 v24; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v25 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (60) ; \ 
bgt (cond), PC OFFSET (52) ; \ 
REST v20 v25; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

^define FREE_THRU_v26 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (68) ; \ 
bgt (cond), PC OFFSET (60) ; \ 
REST v20 v26; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v27 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (76) ; \ 
bgt (cond), PC OFFSET (68) ; \ 
REST v20 v27; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 
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#define FREE_THRU_v28 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET(84); \ 
bgt (cond), PC OFFSET (76) ; \ 
REST v20 v28; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

tfdefine FREE_THRU_v29 ( cond ) \ 
li rO, 0; \ 

beq (cond) , PC OFFSET(92); \ 
bgt (cond), PC OFFSET (84) ; \ 
REST v20 v29; \ 

lwz rO, VRSAVE_SAVE__OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

tfdefine FREE_THRU_v30 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET(IOO) ; \ 
bgt (cond), PC OFFSET (92) ; \ 
REST V20 v30; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v3 1 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET(108) ; \ 
bgt (cond), PC OFFSET (100) ; \ 
REST v20 v31; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#endif /* end BUILD_MAX 

/* 

* macros to save and restore the CR register 

* (uses rO as scratch register) 
*/ 

tfdefine SAVE CR \ 
mfcr rO; \ 

stw rO, CR_SAVE_OFF(sp) ; 

tfdefine REST CR \ 

lwz rO, CR_SAVE_OFF(sp) ; \ 
mtcr rO; 

/* 

* macros to save and restore the LR register 

* (uses rO as scratch register) 
*/ 

#define SAVE LR \ 
mflr rO; \ 

Stw rO , LR_SAVE_OFF ( sp) ; 

#define REST LR \ 

lwz rO, LR_SAVE_OFF(sp) ; \ 
mtlr rO; 

#endif /* end COMPILE_C 



* macros for declaring GPR, FPR and VMX registers 
V 

/* 

* declare rO 
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tfdefine DECLARE_rO 
/* 

* r3 declare set 
*/ 

^define DECLARE r3 
fldefine DECLARE r3 
#define DECLARE r3 
#define DECLARE r3 
^define DECLARE r3 
#define DECLARE r3 
tfdefine DECLARE r3 
^define DECLARE r3 
tfdefine DECLARE r3 
tfdefine DECLARE r3 
#define DECLARE r3 
tfdefine DECLARE r3 
^define DECLARE r3 
#define DECLARE r3 
^define DECLARE r3 
tfdefine DECLARE r3 
^define DECLARE r3 
tfdefine DECLARE r3 
fldefine DECLARE r3 
^define DECLARE r3 
tfdefine DECLARE r3 
#define DECLARE r3 
#define DECLARE r3 
ttdefine DECLARE r3 
#define DECLARE r3 
#define DECLARE r3 
ttdefine DECLARE r3 
^define DECLARE r3 
tfdefine DECLARE_r3 

/* 

* r4 declare set 
*/ 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
fldefine 
#def ine 
#def ine 
^define 
#def ine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
^define 
#def ine 
#def ine 
#def ine 
#def ine 



DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 
DECLARE r4 



3/9/2001 



r4 

r5 

r6 

r7 

r8 

r9 

rlO 

rll 

rl2 

rl3 

rl4 

rl5 

rl6 

rl7 

rl8 

rl9 

r20 

r21 

r22 

r23 

r24 

r25 

r26 

r27 

r28 

r29 

r30 

r31 



r5 

r6 

r7 

r8 

r9 

rlO 

rll 

rl2 

rl3 

rl4 

rl5 

rl6 

rl7 

rl8 

rl9 

r20 

r21 

r22 

r23 

r24 

r25 

r26 

r27 

r28 

r29 

r30 

r31 
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/* 

* r5 declare set 
*/ 

fldef ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
tfdef ine 
#def ine 
#def ine 
#def ine 
ttdef ine 
#def ine 
#def ine 
#def ine 
ttdef ine 
fldefine 
fldefine 
#def ine 
tfdefine 
#def ine 
#def ine 
#def ine 
#def ine 

/* 

* r6 declare set 
*/ 

tfdefine DECLARE r6 
#define DECLARE r6 r7 
tfdefine DECLARE r6 r8 
tfdefine DECLARE r6 r9 
fldefine DECLARE r6 rlO 
tfdefine DECLARE r6 rll 
tfdefine DECLARE r6 rl2 
#define DECLARE r6 rl3 
#define DECLARE r6 rl4 
tfdefine DECLARE r6 r!5 
#define DECLARE r6 rl6 
^define DECLARE r6 rl7 
^define DECLARE r6 rl8 
tfdefine DECLARE r6 rl9 
tfdefine DECLARE r6 r20 
fldefine DECLARE r6 r21 
^define DECLARE r6 r22 
#define DECLARE r6 r23 
tfdefine DECLARE r6 r24 
fldefine DECLARE r6 r25 
^define DECLARE r6 r26 
^define DECLARE r6 r27 
^define DECLARE r6 r28 
fldefine DECLARE r6 r29 
#define DECLARE r6 r30 
tfdefine DECLARE_r 6_r 3 1 

/* 

* r7 declare set 
*/ 

^define DECLARE r7 
^define DECLARE r7 r8 



DECLARE 


r5 




DECLARE 


r5 


r6 


DECLARE 


r5 


r7 


DECLARE 


r5 


r8 


DECLARE 


r5 


r9 


DECLARE 


r5 


rlO 


DECLARE 


r5 


rll 


DECLARE 


r5 


rl2 


DECLARE 


r5 


rl3 


DECLARE 


r5 


rl4 


DECLARE 


r5 


rl5 


DECLARE 


r5 


rl6 


DECLARE 


r5 


rl7 


DECLARE 


r5 


rl8 


DECLARE 


r5 


rl9 


DECLARE 


r5 


r20 


DECLARE 


r5 


r21 


DECLARE 


r5 


r22 


DECLARE 


r5 


r23 


DECLARE 


r5 


r24 


DECLARE 


r5 


r25 


DECLARE 


r5 


r26 


DECLARE 


r5 


r27 


DECLARE 


r5 


r28 


DECLARE 


r5 


r29 


DECLARE 


r5 


r30 


DECLARE 


r5 


r31 
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^define DECLARE r7 
tfdefine DECLARE r7 
tfdefine DECLARE r7 
tfdefine DECLARE r7 
tfdefine DECLARE r7 
^define DECLARE r7 
tfdefine DECLARE r7 
fldefine DECLARE r7 
tfdefine DECLARE r7 
^define DECLARE r7 
^define DECLARE r7 
#define DECLARE r7 
ttdefine DECLARE r7 
#define DECLARE r7 
^define DECLARE r7 
tfdefine DECLARE r7 
tfdefine DECLARE r7 
tfdefine DECLARE r7 
^define DECLARE r7 
#define DECLARE r7 
#define DECLARE r7 
^define DECLARE r7 
tfdefine DECLARE_r7 

/* 

* r8 declare set 
*/ 

tfdefine DECLARE r8 
#define DECLARE r8 
^define DECLARE r8 
fldefine DECLARE r8 
^define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
tfdefine DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
tfdefine DECLARE r8 
#define DECLARE r8 
fldefine DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
#define DECLARE r8 
tfdefine DECLARE r8 
tfdefine DECLARE_r8 

/* 

* r9 declare set 
*/ 

#define DECLARE r9 
#define DECLARE r9 
^define DECLARE r9 
^define DECLARE r9 
#define DECLARE r9 
tfdefine DECLARE r9 
^define DECLARE r9 
tfdefine DECLARE r9 
tfdefine DECLARE r9 
#define DECLARE r9 
fldefine DECLARE r9 
#define DECLARE r9 
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r9 

no 

rll 
rl2 
rl3 
rl4 
rl5 
rl6 
rl7 
rl8 
rl9 
r20 
r21 
r22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 



r9 

no 

rll 
rl2 
rl3 
rl4 
rl5 
rl6 
rl7 
rl8 
rl9 
r20 
r21 
r22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 



rlO 
rll 
rl2 
rl3 
rl4 
rl5 
rl6 
rl7 
rl8 
rl9 
r20 
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$de f ine 


DECLARE 


r9 


r21 


#def ine 


DECLARE 


r9 


r22 


ttdef ine 


DECLARE 


r9 


r23 


#def ine 


DECLARE 


r9 


r24 


#def ine 


DECLARE 


r9 


r25 


ftdef ine 


DECLARE 


r9 


r26 


^define 


DECLARE 


r9 


r27 


Jfdef ine 


DECLARE 


r9 


r28 


#def ine 


DECLARE 


r9 


r29 


#def ine 


DECLARE 


r9 


r30 


#def ine 


DECLARE 


r9 


r31 
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* rlO declare set 



DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 



no 

no rll 

rlO rl2 

rlO rl3 

rlO rl4 

rlO rl5 

rlO rl6 

rlO rl7 

rlO rl8 

rlO rl9 

rlO r20 

rlO r21 

rlO r22 

rlO r23 

rlO r24 

rlO r25 

rlO r26 

rlO r27 

rlO r28 

rlO r29 

rlO r30 

rlO r31 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
^define 
#def ine 
#def ine 
#def ine 
#def ine 
^define 
#def ine 
^define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 

/* 

* rll declare set 
*/ 

tfdefine DECLARE rll 
^define DECLARE rll rl2 
#define DECLARE rll rl3 
#define DECLARE rll rl4 
ttdefine DECLARE rll rl5 
#define DECLARE rll rl6 
#define DECLARE rll rl7 
tfdefine DECLARE rll rl8 
#define DECLARE rll rl9 
tfdefine DECLARE rll r20 
#define DECLARE rll r21 
#define DECLARE rll r22 
#define DECLARE rll r23 
#define DECLARE rll r24 
#define DECLARE rll r25 
#define DECLARE rli r26 
#define DECLARE rll r27 
tfdefine DECLARE rll r28 
#define DECLARE rll r29 
#define DECLARE rll r30 
^define DECLARE rll r31 



/* 

* rl2 declare set 
*/ 

tfdefine DECLARE rl2 
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fldefine DECLARE rl2 rl3 
ttdefine DECLARE rl2 rl4 
ffdefine DECLARE rl2 rl5 
ffdefine DECLARE rl2 rl6 
ffdefine DECLARE rl2 rl7 
ffdefine DECLARE rl2 rl8 
#define DECLARE rl2 rl9 
ffdefine DECLARE rl2 r20 
ffdefine DECLARE rl2 r21 
ffdefine DECLARE rl2 r22 
#define DECLARE rl2 r23 
#define DECLARE rl2 r24 
#define DECLARE rl2 r25 
ffdefine DECLARE rl2 r26 
ffdefine DECLARE rl2 r27 
#define DECLARE rl2 r28 
ffdefine DECLARE rl2 r29 
ffdefine DECLARE rl2 r30 
#define DECLARE_rl2_r31 

/* 

* rl3 declare set 
*/ 

ffdefine DECLARE rl3 
ffdefine DECLARE rl3 rl4 
#define DECLARE rl3 rl5 
#define DECLARE rl3 rl6 
ffdefine DECLARE rl3 rl7 
#define DECLARE rl3 rl8 
ffdefine DECLARE rl3 rl9 
ffdefine DECLARE rl3 r20 
ffdefine DECLARE rl3 r21 ' 
#define DECIDE rl3 r22 
ffdefine DECLARE rl3 r23 
ffdefine DECLARE rl3 r24 
#define DECLARE rl3 r25 
#define DECLARE rl3 r26 
ffdefine DECLARE rl3 r27 
#define DECLARE rl3 r28 
#define DECLARE rl3 r29 
ffdefine DECLARE rl3 r30 
#define DECLARE_r 1 3_r3 1 

/* 

* rl4 declare set 
*/ 

#define DECLARE rl4 

#define DECLARE rl4 rl5 

#define DECLARE rl4 rl6 

ffdefine DECLARE rl4 rl7 

ffdefine DECLARE rl4 rl8 

ffdefine DECLARE rl4 rl9 

ffdefine DECLARE rl4 r20 

ffdefine DECLARE rl4 r21 

#define DECLARE rl4 r22 

#define DECLARE rl4 r23 

#define DECLARE rl4 r24 

ffdefine DECLARE rl4 r25 

ffdefine DECLARE rl4 r26 

ffdefine DECLARE rl4 r27 

ffdefine DECLARE rl4 r28 

ffdefine DECLARE rl4 r29 

ffdefine DECLARE rl4 r30 
ffdefine DECLARE_rl4_r31 

/* 

* rl5 declare set 
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ttdefine DECLARE rl5 
ffdefine DECLARE rl5 
^define DECLARE rl5 
#define DECLARE rl5 
#define DECLARE rl5 
tfdefine DECLARE rl5 
tfdefine DECLARE rl5 
tfdefine DECLARE rl5 
fldefine DECLARE rl5 
^define DECLARE rl5 
tfdefine DECLARE rl5 
ffdefine DECLARE rl5 
^define DECLARE rl5 
#define DECLARE rl5 
#define DECLARE rl5 
fldefine DECLARE rl5 
ttdefine DECLARE_r 1 5 

/* 

* rl6 declare set 
*/ 

#define DECLARE rl6 
fldefine DECLARE rl6 
^define DECLARE rl6 
tfdefine DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
tfdefine DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
#define DECLARE rl6 
^define DECLARE rl6 
tfdefine DECLARE_r 1 6 

/* 

* rl7 declare set 
*/ 

tfdefine DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 

tfdefine DECLARE rl7 

tfdefine DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 

fldefine DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 

#define DECLARE rl7 
tfdefine DECLARE_rl7 

/* 

* rl8 declare set 
*/ 

#define DECLARE rl8 

#define DECLARE rl8 

fldefine DECLARE rl8 

tfdefine DECLARE rl8 

fldefine DECLARE rl8 

^define DECLARE rl8 



3/9/2001 



rl6 
rl7 
rl8 
rl9 
r20 
r21 
r22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 



rl7 
rl8 
rl9 
r20 
r21 
r22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 



rl8 
rl9 
r20 
r21 
r22 
r23 
r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 



rl9 
r20 
r21 
r22 
r23 
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#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
tfdef ine 
#def ine 
#def ine 



DECLARE rl8 
DECLARE rl8 
DECLARE rl8 
DECLARE rl8 
DECLARE r!8 
DECLARE rl8 
DECLARE rl8 
DECLARE r!8 



r24 
r25 
r26 
r27 
r28 
r29 
r30 
r31 
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+ rl9 declare set 
*/ 

#define DECLARE rl9 
tfdefine DECLARE rl9 r20 
tfdefine DECLARE rl9 r21 
#define DECLARE rl9 r22 
#define DECLARE rl9 r23 
ttdefine DECLARE rl9 r24 
tfdefine DECLARE rl9 r25 
#define DECLARE rl9 r26 
#define DECLARE rl9 r27 
tfdefine DECLARE rl9 r28 
#define DECLARE rl9 r29 
ttdefine DECLARE rl9 r30 
ttdefine DECLARE_r 1 9_r 3 1 



* FPR single precision declare set 



#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
^define 
^define 
#def ine 
#def ine 
#def ine 
#def ine 



DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 



fO 

fO fl 
fO f2 
fO f3 
fO f4 
fO f5 
fO f6 
fO tl 
fO f8 
fO f9 
fO flO 
fO fll 
fO fl2 
fO fl3 
fO fl4 
fO fl5 
fO fl6 
fO fl7 
fO fl8 
fO fl9 
fO f20 
fO f21 
fO f22 
fO f23 
fO f24 
fO f25 
fO f26 
fO f27 
fO f28 
fO f29 
fO f30 
fO f31 



/' 



FPR double precision declare set 



^define DECLARE dO 
#define DECLARE dO dl 



64 



Page No. 401 



EV 093 931 868 US 



salppc . inc 






« cie t i ne 


r\CrM ADC 


d0 


d2 


#def ine 


nnpt ADC 


d0 


d3 


8def ine 


r^C/"*T ADC 

LJfc.v_I.iAKh 


do 


d4 


$ de f i ne 




dO 


d5 




npn ADC 


dO 


d6 




r»c/* , T ADC 


dO 


d7 


ffaei ine 


T^C/^T ADC 


dO 


d8 


ttdef i ne 


nrpT adc 


dO 


d9 


ttuei l ne 


nDPT ADC 

Ufc»L,L»AKfc» 


dO 


dlO 


ffuei xne 


r»cr»T adc 


dO 


dll 


tf aei ine 


rtcr*T ADC 


dO 


dl2 


tf aei ine 


nripT ADC 

Ufc.L,LiAKfc. 


dO 


dl3 


ttuei xne 


rkcr>T adc 


dO 


dl4 


ttde f i ne 


r"\ Tp T ADC 


dO 


dl5 


$def ine 


rkCi^T ADC 


dO 


dl6 


#def i ne 


FkC/^T ADC 


dO 


dl7 


$def ine 


nppi ADC 


dO 


dl8 


true t i ne 


T\X?C\ ADC 


dO 


dl9 


truer ine 


PkC/^T ADC 


dO 


d20 


#def ine 


DECLARE 


dO 


d21 


#def ine 


DECLARE 


dO 


d22 




DCPT.ADC 


r\n 
uu 




#def ine 


DECLARE 


dO 


d24 


#def ine 


DECLARE 


dO 


d25 


#def ine 


DECLARE 


dO 


d26 


#def ine 


DECLARE 


dO 


d27 


#def ine 


DECLARE 


dO 


d28 


#def ine 


DECLARE 


dO 


d29 


#define 


DECLARE 


dO 


d30 


#define 


DECLARE 


dO 


d31 


/* 
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VMX declare set 



*/ 

#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdef ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 



DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 
DECLARE vO 



vl 

v2 

v3 

v4 

v5 

v6 

v7 

v8 

v9 

vlO 

vll 

vl2 

vl3 

vl4 

vl5 

vl6 

vl7 

vl8 

vl9 

v20 

v21 

v22 

v23 

v24 

v25 

v26 

v27 

v28 

v29 

v30 

v31 
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tfendif /* end SALPPC_INC */ 



END OF FILE salppc.inc 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: SVE3 8BIT.MAC 

Description: Sum the elements of 3 signed byte vectors 
each of length N . 

sve3_8bit ( char *A, char *B, char *C, long *SUM, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000605 fpl Created 



^include "salppc . ine" 
/** 

Input parameters 
**/ 



#def ine 


A 


r3 


#def ine 


B 


r4 


#def ine 


C 


r5 


#def ine 


SUM 


r6 


#def ine 


N 


r7 


#def ine 


AOp 


A 


#def ine 


BOp 


B 


#def ine 


COp 


C 


#def ine 


Alp 


r8 


#def ine 


Blp 


r9 


#def ine 


Clp 


rlO 


#def ine 


index 


rll 


#def ine 


zero 


vO 


#def ine 


one 


vl 


#def ine 


aO 


v2 


#def ine 


al 


v3 


#def ine 


bO 


v4 


#def ine 


bl 


v5 


#def ine 


cO 


v6 


#def ine 


cl 


v7 


#def ine 


sumO 


v8 


#def ine 


suml 


v9 


#def ine 


sum2 


vlO 



FUNC_PROLOG 

ENTRY_5 ( sve3_8bit, A, B, C, SUM, N ) 

USE_THRU_vlO { VRSAVE_COND ) 

LI ( index, 0 ) 

VXOR ( zero, zero, zero ) 
ADDIC C( N, N, -32 ) 
LVX( aO, AOp, index ) 

VSPLTISB( one, 1 ) 
LVX( bO, BOp, index ) 
ADDI( Alp, AOp, 16 ) 

VXOR( sumO, sumO, sumO ) 



Page No. 404 



EV 093 931 868 US 
Page No. 431 

sve3 8bit.mac 



2/23/2001 



ADDI ( Blp, BOp, 16 ) 

VXOR( suml, suml, suml ) 
ADDI ( Cip, COp, 16 ) 

VXOR( sum2 ( sum2, sum2 ) 
BLT ( dol6 ) 



LABEL ( loop ) 

ADDIC C( N, N, -32 ) 
LVX{ cO, COp, index ) 

VMSUMMBM ( sumO, aO, 
LVX( al, Alp, index ) 

VMSUMMBM ( suml, bO, 
LVX( bl, Blp, index ) 

VMSUMMBM ( sum2, CO, 
LVX( cl, Clp, index ) 
ADDI ( index, index, 32 ) 



one , sumO ) 
one, suml ) 
one , sum2 ) 



VMSUMMBM ( 


sumO, 


al, 


one, 


sumO 


) 


LVX{ aO, AOp, 


index 


) 








VMSUMMBM ( 


suml , 


bl, 


one, 


suml 


) 


LVX( bO, BOp, 


index 


) 








VMSUMMBM ( 


sum2 , 


cl, 


one, 


sum2 


) 



BGE ( loop } 



CMPWI{ N, -32 ) 
BEQ ( combine ) 



LABEL ( dol6 ) 

LVX{ c0, COp, index 

VMSUMMBM ( sumO, 

VMSUMMBM ( suml, 

VMSUMMBM ( sum2, 



) 

aO, one, sumO ) 

bO, one, suml ) 

cO, one, sum2 ) 



LABEL ( combine ) 

VADDUWM( sumO, sumO , suml ) 
VADDUWM ( sumO , sumO , sum2 ) 
VSUMSWS( sumO, sumO , zero ) 
VSPLTW( sumO, sumO, 3 ) 
STVEWX( sumO, 0, SUM ) 

FREE THRU_vlO( VRSAVE_COND ) 
RETURN 



FUNC EPILOG 
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--** Majority Voter /Sync Control logic TOP LEVEL Module: voter_sync . vhd 
* * 

--** Description: This Module is the top level of the 

--** Majority Voter and Raceway Sync Logic 
_ _ * * 

--** Author : Steven Imperial i 

--** Date : 7-05-2000 

--** Date : 10-25-2000 Modified cable clock and sync 

_ _ * * 

__************************************************************* 

This PLD handles the following functions: 

1) Raceway clock source and skew control 

2) Raceway sync generation 

3) Majority voter logic 

4) I2C reset logic 

5) Inverter for the HS LED signal 



LIBRARY IEEE; 

USE IEEE.STD LOGIC 1164. ALL ; 
USE STD . TEXT 10 . ALL; 
use ieee.std logic arith.all; 
use ieee . std_logic_unsigned. all ; 

ENTITY voter_sync IS 
PORT ( 



elk 66 pal6 


IN 


std 


logic- 


elk 33 pall 


IN 


std 


logic; 


reset 0 


IN 


std 


logic- 


x rst brd 0 


OUT 


std 


logic; 


x rst brd 1 


OUT 


std 


logic; 


pll rng sel 


OUT 


std 


logic- 


pll freq sel 


OUT 


std 


logic; 


fb sk sel 


OUT 


std 


logic- 


fb dev by 2 0 


OUT 


std 


logic ; 


main sk selO 


OUT 


std 


logics- 


main sk sell 


OUT 


std 


logic; 


jk sk selO 


OUT 


std 


logic ; 


jk sk sell 


OUT 


std 


logic ; 


jxl elk oe 


OUT 


std 


logic- 


jx2 elk oe 


OUT 


std 


logic; 


sw elk mode2_l 


IN 


std 


logic vector (2 downto 1) 


mux elk selO 


OUT 


std 


logic; 


mux_clk_sell 


OUT 


std_logic; 



testn 
tmsO 

rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x_xbar 

ndO resetreq 0 
ndl resetreq 0 
nd2 resetreq 0 
nd3 resetreq 0 
pq resetreq_0 
resetvote 0 



IN 

IN 

OUT 

OUT 

OUT 

OUT 

OUT 

OUT 

IN 
IN 
IN 
IN 
IN 
OUT 



std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic ; 
std logic; 
std_logic; 

std logic ; 
std logic; 
std logic; 
std logic; 
std logic; 
std_logic; 



nd0_ckstpreqnd0_0 : IN std_logic; 
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ndO ckstpreqndl 0 : IN 
ndO ckstpreqnd2 0 : IN 
ndO ckstpreqnd3 0 : IN 
ndO ckstpreqpq 0 :IN 
ndl ckstpreqndO 0 :IN 
ndl ckstpreqndl 0 :IN 
ndl ckstpreqnd2 0 :IN 
ndl ckstpreqnd3 0 :IN 
ndl ckstpreqpq 0 :IN 
nd2 ckstpreqndO 0 : IN 
nd2 ckstpreqndl 0 :IN 
nd2 ckstpreqnd2 0 : IN 
nd2 ckstpreqnd3 0 :IN 
nd2 ckstpreqpq 0 : IN 
nd3 ckstpreqndO 0 :IN 
nd3 ckstpreqndl 0 :IN 
nd3 ckstpreqnd2 0 :IN 
nd3 ckstpreqnd3 0 :IN 
nd3 ckstpreqpq 0 : IN 
pq ckstpreqndO 0 : IN 
pq ckstpreqndl 0 : IN 
pq ckstpreqnd2 0 : IN 
pq ckstpreqnd3 0 : IN 
pq ckstpreqpq_0 : IN 
pq ckstopin 0 :OUT 
ndO ckstopin 0 :OUT 
ndl ckstopin 0 :OUT 
nd2 ckstopin 0 :OUT 
nd3_ckstopin_0 :OUT 

i2c_rst_0 :IN 

sda ~~ : INOUT 

SCl : INOUT 

pxbO hs_led : IN 

hs_led :OUT 
) ; 



std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic ; 
std logic; 
std logic ; 
std logic; 
std_logic; 

std logic; 
std logic; 
std logic ; 
std logic; 
std_logic 



END voter_sync; 

ARCHITECTURE TOP_LEVEL_voter_sync OF voter_sync IS 



__************* 
★ + ★★********★ 



:******★* + * + ★* + *********★* + *********■******★**+**** 



-** Component Declearation 

. ★**★**★*****★ + **★*★★*★****** + *****★★*****★ + ***************★**** 



COMPONENT m voter PORT( 



elk 66 pal6 


IN 


std 


logic; 


reset 0 




IN 


std 


logic; 


requestO 


0 


IN 


std 


logic; 


requestl 


0 


IN 


std 


logic- 


request2 


0 


IN 


std 


logic; 


request3 


0 


IN 


std 


logic- 


request4 


0 


IN 


std 


logic; 


healthyO 


1 


IN 


std 


logic- 


healthyl 


1 


IN 


std 


logic; 


healthy2 


1 


IN 


std 


logic- 


healthy3 


1 


IN 


std 


logic; 


healthy4 


1 


IN 


std 


logic; 


voteout_0 


OUT 


std__logic) 



END COMPONENT ; 
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-** Signals to Connect All of the Components Together 



Signal healthyO 1 
Signal healthyl 1 
Signal healthy2 1 
Signal healthy3 1 
Signal healthy4_l 
Signal sync dl 
Signal sync d2 
Signal sync d3 
Signal ndO ckstop_0 
:std logics- 
Signal g ndO resetreq 0 
Signal g ndl resetreq 0 
Signal g nd2 resetreq 0 
Signal g_nd3_resetreq_0 



std logic; 
std logic ; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 

ndl_ekstop_0, nd2_ckstop_0, nd3_ckstop_0, pq_ckstop_0 

std logic ; 
std logic; 
std logic; 
std_logic; 



BEGIN 

--** Begin Architecture Here (Instantiations) 



nd0_ckstop voter : m_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqndO 0, 
ndl ckstpreqndO 0, 
nd2 ckstpreqndO 0, 
nd3 ckstpreqndO 0, 
pq ckstpreqnd0_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd0_ckstop_0) ; 



ndl_ckstop voter : m_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqndl 0, 
ndl ckstpreqndl 0, 
nd2 ckstpreqndl 0, 
nd3 ckstpreqndl 0, 
pq ckstpreqndl_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
ndl_ckstop_0) ; 



nd2_ckstop voter : m_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd2 0, 
ndl ckstpreqnd2 0, 
nd2_ckstpreqnd2_0 , 
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nd3 ckstpreqnd2 0, 
pq ckstpreqnd2_0, 
healthyO 1, 
heal thy 1 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd2_ckstop_0) ; 



nd3_ckstop voter : m_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd3 0, 
ndl ckstpreqnd3 0, 
nd2 ckstpreqnd3 0, 
nd3 ckstpreqnd3 0, 
pq ckstpreqnd3_0, 
healthyO 1, 
heal thy 1 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd3_ckstop_0) ; 



pq_ckstop voter : m voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqpq 0, 
ndl ckstpreqpq 0, 
nd2 ckstpreqpq 0, 
nd3 ckstpreqpq 0, 
pq ckstpreqpq_0 , 
healthyO 1, 
heal thy 1 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
pq_ckstop_0) ; 



this section was added to force a board level reset when 
the 8240 has a watchdog failure. 

-- this should have been done by feeding the 824 0's WDFAIL 
--to the reset PLD instead of forcing the 8240' s resetreq 
-- to drive all other resetrequests . 

g ndO resetreq 0 <= ndO resetreq 0 AND pq resetreq 0; 
g ndl resetreq 0 <= ndl resetreq 0 AND pq resetreq 0 ; 
g nd2 resetreq 0 <= nd2 resetreq 0 AND pq resetreq 0 ,- 
g_nd3_resetreq_0 <= nd3_resetreq_0 AND pq_resetreq_0 ; 



reset_req voter : m voter PORT Map( 
elk 66 pal6, 
reset 0, 

g ndO resetreq 0 f 
g ndl resetreq 0, 
g nd2 resetreq 0, 
g_nd3_resetreq_0 , 
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pq resetreq_0, 
healthyO 1, 
healthyl 1 
healthy2 1 
healthy3 1 
healthy4 1 
resetvote 0) ; 
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healthyO 1 <= ndO ckstop 0 
healthyl 1 <= ndl ckstop 0 
healthy2 1 <= nd2 ckstop 0 
healthy3 1 <= nd3 ckstop 0 
healthy4_JL <= pq_ckstop_0; 



ndO ckstopin 0 
ndl ckstopin 0 
nd2 ckstopin 0 
nd3 ckstopin 0 
pq_ckstopin_0 



<= ndO ckstop 0 
<= ndl ckstop 0 
<= nd2 ckstop 0 
<= nd3 ckstop 0 
<= pq_ckstop_0; 



WITH i2c_rst_0 SELECT 

sda <= clk_33_pall WHEN '0\ 
1 Z ' WHEN ' 1 1 , 

' Z' WHEN OTHERS; 

WITH i2c_rst_0 SELECT 

scl <= clk_33_pall WHEN '0', 
* Z ' ^ WHEN • 1 ' , 

' Z' WHEN OTHERS ; 



hs_led <= NOT(pxb0_hs_led) ; 



-- Sync Control 

process (clk_66_pal6 , reset_0) 

BEGIN 

IF (reset 0 = '0') THEN 
sync dl 
sync d2 
sync d3 
rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x_xbar 



<= 

< = 

< = 

< = 

< = 



ELS IF (testn 



• 0' AND reset 0 



rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x_xbar 



<= tmsO; 
<= • tmsO; 
<= tmsO; 
<= tmsO; 
<= '0'; 
<= '0'; 



' 1') THEN 



ELS IF rising edge (elk 66 pal6) THEN 
sync dl <= NOT(sync dl) ; 

sync_d2 <= (NOT (sync_d2 ) AND sync_dl OR sync_d2 AND 
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NOT (sync dl)) 



END IF; 
END process; 



rsync 


X 


ndO 


< = 


sync 


d3 


rsync 


X 


ndl 


< = 


sync 


d3 


rsync 


X 


nd2 


< = 


sync 


d3 


rsync 


X 


nd3 


< = 


sync 


d3 


rsync 


X 


pxbO 




sync 


d3 


rsync_ 


X 


xbar 


< = 


sync_ 


d3 



AND sync_d2 ) ) ; 



x rst brd 0 <= reset 0; 
x_rst_brd_l <= NOT(reset_0) ; 



WITH sw elk mode2 1 SELECT 
mux elk selO <= '0' 



WHEN "00", -- 66MHz local 

'0' WHEN "01", 33MHz cable 1 

WHEN "10", -- 33MHz cable 2 

'0' WHEN "11", -- 66 MHz local 

■1' WHEN OTHERS; 



WITH sw elk mode2 1 SELECT 
mux elk sell <= 



fb_dev_by_2_0 



WITH sw clk_mode2 1 SELECT 
jxl_clk_oe <= 



jx2_clk_oe 



pll_rng_sel 



pll_f req_sel 



* 0 1 




WHEN 


"00", 








• 1 ' 


WHEN 


"01" 

V J. , 




' 1 ' 


WHEN 


"10", 








»0' 


WHEN 


"11", 




•1 ' 


WHEN OTHERS; 




SELECT 










•0' 




WHEN 


"00", 








'Z' 


WHEN 


"01", 




•Z' 


WHEN 


"10", 








'0' 


WHEN 


"11", 




'1 ' 


WHEN OTHERS; 




SELECT 










•1' 




WHEN 


"00", 






■1» 


WHEN 


"Ol", 




•1' 




WHEN 


"10", 






■ 1 » 


WHEN 


"11", 








WHEN 


OTHERS; 




SELECT 










1 1 ' 




WHEN 


"00" , 






»1 ■ 


WHEN 


"01", 




'1' 




WHEN 


"10", 






■1' 


WHEN 


"11", 




'1' 




WHEN 


OTHERS; 




SELECT 










'1' 




WHEN 


"00", 








■ 1 ' 


WHEN 


"01", 




•1' 


WHEN 


"10", 








• 1 1 


WHEN 


"11", 




'1' 


WHEN OTHERS; 




SELECT 














WHEN 


"00", 








'0' 


WHEN 


"01" , 




'0' 


WHEN 


"10", 








•Z' 


WHEN 


"11", 
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WHEN OTHERS; 



select 0 skew for all modes 





elk mode2 1 SELECT 










~fb_sk_sel <= 'Z' 


WHEN 


"00" , 








•Z 1 


WHEN 


"01", 




'Z' 


WHEN 


"10", 








•z« 


WHEN 


"11 " , 




•1' 


WHEN OTHERS; 




sw_ 


elk mode 2 1 SELECT 










main_sk_selO <= 1 Z' 


WHEN 


"00" , 








• Z ' 


WHEN 


"01", 




•Z' 


WHEN 


"10", 








' Z 1 


WHEN 


"11", 




■1' 


WHEN 


OTHERS; 




sw 


elk mode 2 1 SELECT 










main sk sell <= ' Z' 


WHEN 


"00", 








•Z' 


WHEN 


"01", 




»Z' 


WHEN 


"10", 








•Z' 


WHEN 


"11", 




'1' 


WHEN 


OTHERS; 




sw 


elk mode2 1 SELECT 










jk_sk_selO < = ' Z' 


WHEN 


"00", 








'Z' 


WHEN 


"01", 




■Z' 


WHEN 


"10", 








•Z' 


WHEN 


"11", 




•l 1 


WHEN 


OTHERS ; 




sw_ 


elk mode2 1 SELECT 










jk_sk_sell <= ' Z' 


WHEN 


"00" , 








'Z 1 


WHEN 


"01", 




'Z' 


WHEN 


"10", 








'Z' 


WHEN 


"11", 




•1' 


WHEN 


OTHERS; 
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END TOP_LEVEL_voter_sync; 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: ZDOTPR4 VMX . K 

Description: CPP Source code for Vector Single Precision 
Split Complex Dot Product given that input 
vectors are relivatively unaligned. 

Entry/params: ZDOTPR4 VMX (A, I, B, J, C, N) 
ZIDOTPR4 VMX (A, I, B, J, C, N) 



Formula: C[0] 
C[l] 



sum (A->realp[mI] *B->realp [mJ] 

-/+ A->imagp [ml] *B- >imagp [mJ] ) 

sum (A->realp[mI] *B->imagp [mJ] 

+/- A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000608 fpl Created (from zdotpr vmx.k) 

+ 

^include "salppc . inc" 
/** 

ESAL CPP definitions 
**/ 

tfundef FUNC ENTRY 
#undef LOAD A 
#undef LOAD B 
#undef SUFFIX 

#if defined ( VMX_SAL ) 

#define FUNC ENTRY zdotpr4 vmx 

tfdefine FUNC CONJ ENTRY _zidotpr4 vmx 

fldefine LOAD A( vT, rA # rB ) LVX ( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) LVX( vT, rA, rB ) 

#define SUFFIX ( label ) label 



tfelif defined ( VMX_NN ) 

#define FUNC ENTRY zdotpr4 vmx nn 

#define FUNC CONJ ENTRY _zidotpr4_vmx_nn 
tfdefine LOAD A( vT, rA, rB ) LVXL { vT, rA, rB ) 
#define LOAD B( vT, rA, rB ) LVXL { vT, rA, rB ) 
tfdefine SUFFIX ( label ) label##_nn 



#elif defined ( VMX_NC ) 

#define FUNC ENTRY zdotpr4 vmx nc 

^define FUNC CONJ ENTRY _zidotpr4_vmx_nc 
#define LOAD A( vT, rA, rB ) LVXL( vT, rA, rB ) 
#define LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 
#define SUFFIX ( label ) label## nc 



#elif defined ( VMX CN ) 



#define FUNC ENTRY 

#define FUNC CONJ ENTRY 

#define LOAD A( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) 

tfdefine SUFFIX ( label ) 



zdotpr4 vmx cn 
_zidotpr4 vmx cn 

LVX( vT, rA, rB ) 
LVXL{ vT, rA, rB ) 
label## cn 



nelif defined ( VMX CC ) 
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#define FUNC ENTRY 

#define FUNC CON J ENTRY 

tfdefine LOAD A( vT, rA, 

#define LOAD B( vT, rA, 

ttdefine SUFFIX ( label ) 



zdotpr4 vmx cc 
_zidotpr4 vmx cc 
rB ) LVX( vT, rA, rB ) 
rB ) LVX( vT, rA, rB ) 
labeltf# cc 



#else 

terror YOU MUST DEFINE VMX_xxx, where x = C or N 
#endif 



tfdefine VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 
/** 

Local CPP definitions 
**/ 

#define NMASK2 0x8 

#define NMASK1 0x4 

ttdefine NSHIFT 4 

^define ADDRESS_INCREMENT 16 

/** 
Input args 

**/ 

#define A r3 

#define I r4 

#define B r5 

#define J r6 

#define C r7 

tfdefine N r8 
#define EFLAG r9 



/** 

Split complex parameters 
**/ 

#define ArO A 

tfdefine AiO rlO 

#define BrO B 

#define BiO rll 

tfdefine Cr C 

#define Ci rl2 



I * ★ 

Local registers 
**/ 

tfdefine count r4 
tfdefine rtmpO r4 
#define rtmpl rl3 



#define Arl rl3 

ftdefine Ail rl4 

#define Ar2 rl5 

tfdefine Ai2 rl6 

tfdefine Ar3 rl7 

#define Ai3 rl8 



#define Brl rl9 
tfdefine Bil r20 
^define Br2 r21 
#define Bi2 r22 
#define Br3 r23 
#define Bi3 r24 
#define aoffset r25 
#define coffset r25 
tfdefine boffset r26 
tfdefine addr incr r27 
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/** 

VMX registers 
* * J 

tfdefine rsumr vO 
#define rsumi vl 
#define isumr v2 
ffdefine isumi v3 
^define rsumO v4 
^define rsuml v5 
#define isumO v6 
#define isuml v7 



tfdefine arO v4 
tfdefine aiO v5 
ftdefine arl v6 
ttdefine ail v7 
tfdefine ar2 v8 
#define ai2 v9 
#define ar3 vlO 
tfdefine ai3 vll 



ttdefine brO vl2 

tfdefine biO vl3 

tfdefine brl vl4 

#define bil vl5 

#define br2 vl6 

fldefine bi2 vl7 

tfdefine br3 vl8 

#define bi3 vl9 

#define apC v20 

fldefine atrO v21 

#define atiO v22 

tfdefine atrl v23 

#define atil v24 

fldefine atr2 v25 

ttdefine ati2 v26 

fldefine atr3 v27 

tfdefine ati3 v28 

/*★ 

FPU registers 
* */ 

#define far fO 

#define fbr fl 

#define fai f2 

tfdefine fbi f3 

tfdefine frsumr f4 

#define frsumi f5 

#define fisumi f6 

fldefine fisumr f7 

fldefine frsum f8 

#define fisum f9 
fldefine rsum vmx flO 
fldefine isum_vmx fll 



Begin code text, Save some registers 
Here for conjugate inner product 
**/ 

U_ENTRY( FUNC CONJ_ENTRY ) 
MR(rtmpO, Cr) 
MR (Cr , Ci) 
MR (Ci , rtmpO) 
MR(rtmpO, BrO) 
MR (BrO , BiO) 
MR (BiO, rtmpO) 
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Here for normal inner product 
**/ 

FUNC PROLOG 

U_ENTRY( FUNC ENTRY ) 

DECLARE fO fll 

DECLARE r3 r27 

DECLARE_vO_v28 

/** 

Initial setup code 
**/ 

SAVE rl3 r27 

USE THRU v28( VREGSAVE_COND ) 

LFS( frsumr f ArO f 0 ) 

FSUBS (frsumr, frsumr, frsumr) 

FMR( frsumi, frsumr) 

FMR( fisumr, frsumr) 

FMR (fisumi, frsumr) 

FMR (rsum vmx, frsumr) 

FMR{isum_vmx / frsumr) 

/** 

Process unaligned vector section first 

* */ 

LABEL ( SUFFIX ( cont ) ) 

GET_VMX UNALIGNED_COUNT( count, BrO ) 

LI( aoffset, 0 ) 

LI{ boffset, 0 ) 

BEQ ( SUFFIX ( aligned ) ) 

SUB( N, N, count ) /* adjust N for after loop */ 

/★* 

Here to do first 1 to 3 points using standard FP 
Store result for later post_loop processing 
**/ 

LFSX{ far, ArO, aoffset ) 
LFSX( fai, AiO, aoffset ) 
DECR C( count ) 
LFSX( fbr, BrO, 
LFSX( fbi, BiO, 
FMULS( frsumr, 
FMULS( frsumi, 
FMULS( fisumi, 
FMULS( fisumr, 
ADDI { ArO, ArO, 4 ) 
ADDI { AiO, AiO, 4 ) 
ADDI { BrO, BrO, 4 ) 
ADDI ( BiO, BiO, 4 ) 
BEQ ( SUFFIX ( aligned ) 
J * * 

Loop does 1 or 2 more sum updates 

* ★/ 

LABEL ( SUFFIX ( pre_loop ) ) 
LFSX{ far, ArO, aoffset ) 
LFSX( fai, AiO , aoffset ) 
DECR C( count ) 
LFSX( fbr, BrO, boffset ) 
LFSX( fbi, BiO, boffset ) 
FMADDS ( frsumr, far, fbr, frsumr ) 
ADDI ( ArO, ArO, 4 ) 
FMADDS ( frsumi, fai, 
ADDI ( AiO, AiO, 4 ) 
FMADDS ( fisumi, far, 
ADDI { BrO, BrO, 4 ) 
FMADDS ( fisumr, fai, 
ADDI { BiO, BiO, 4 ) 
BNE ( SUFFIX ( pre_loop) ) 

I * * 

Here for VMX aligned loop code 



boffset ) 
boffset ) 
far, fbr ) 



fai, 
far, 
fai, 



fbi ) 
fbi ) 
fbr ) 



) 



fbi, 
fbi, 
fbr, 



frsumi ) 
fisumi ) 
fisumr ) 
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Prepare for loop entry: assign loop pointers, counters 
**/ 

LABEL ( SUFFIX ( aligned ) ) 

SRWI C( count, N, 4 ) /* 16 per trip */ 
LVSL( apC, ArO, aoffset ) 
LI( aoffset, 0 ) 
LI( boffset, 0 ) 

ADDI { Arl, ArO, 16 ) 
VX0R( rsumr, rsumr, rsumr ) 
ADDI ( Ar2, ArO, 32 ) 
ADDI ( Ar3, ArO, 48 ) 

ADDI ( Ail, AiO, 16 ) 
VXOR( isumi, isumi, isumi ) 
ADDI ( Ai2, AiO, 32 ) 
ADDI ( Ai3, AiO, 48 ) 

ADDI ( Brl, BrO, 16 ) 
VXOR( rsumi, rsumi, rsumi ) 
ADDI ( Br2, BrO, 32 ) 
ADDI { Br3, BrO, 48 ) 

ADDI ( Bil, BiO, 16 ) 

ADDI ( Bi2, BiO, 32 ) 

VXOR{ isumr, isumr, isumr ) 

ADDI ( Bi3, BiO, 48 ) 

BEQ ( SUFFIX (two left) ) 

/** 

Loop windin section 
**/ 

LOAD A( atrO, ArO, aoffset ) 
LOAD A( atiO, AiO, aoffset ) 
LOAD A{ atrl, Arl, aoffset ) 
LOAD_A( atil, Ail, aoffset ) 

LOAD A( atr2, Ar2, aoffset ) 
LOAD A{ ati2, Ai2, aoffset ) 
VPERM( arO, atrO, atrl, apC ) 
LOAD B( brO, BrO, boffset ) 
LOAD B{ MO, BiO, boffset ) 
DECR C( count ) 
VPERM< aiO, atiO, atil, apC ) 
LOAD B( brl, Brl, boffset ) 
VPERM( arl, atrl, atr2, apC ) 
LOAD A( atr3, Ar3, aoffset ) 
BR ( SUFFIX ( mid_loop ) ) 

I * * 

Top of vector loop 
* */ 

LABEL ( SUFFIX ( loop ) ) 

/* { V 

LOAD A( atr2, Ar2 , aoffset ) 
VMADDFP ( rsumr, ar3 , br3, rsumr ) 
LOAD A( ati2, Ai2, aoffset ) 

VPERM( arO, atrO, atrl, apC ) /* uses last pass value 

VMADDFP ( rsumi, ai3, bi3, rsumi ) 

LOAD B( brO, BrO, boffset ) 

LOAD B( biO, BiO, boffset ) 

DECR C{ count ) 

VPERM( aiO, atiO, atil, apC ) 

LOAD B( brl, Brl, boffset ) 

VPERM( arl, atrl, atr2, apC ) 

VMADDFP ( isumi, ar3 , bi3, isumi ) 

LOAD A( atr3, Ar3 , aoffset ) 

VMADDFP ( isumr, ai3, br3 , isumr ) 
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Loop entry 
**/ 

LABEL ( SUFFIX ( mid loop ) ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 
VPERM( ail, atil, ati2, apC ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ati3, Ai3, aoffset ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( bil, Bil # boffset ) 
ADDI { aoffset, aoffset, 64 ) 
VPERM( ar2, atr2, atr3, apC ) 
VMADDFP ( isumi, arO , biO, isumi ) 
LOAD B( br2, Br2, boffset ) 
VMADDFP { rsumr, arl , brl , rsumr ) 
LOAD B( bi2, Bi2, boffset ) 
VMADDFP ( isumr, ail, brl, isumr ) 

/ * * 

Loop exit 
**/ 

VPERM{ ai2, ati2, ati3, apC ) 
BEQ ( SUFFIX (loop exit ) ) 
LOAD A( atrO, ArO, aoffset ) 
VMADDFP { rsumi, ail, bil, rsumi ) 
LOAD A( atiO, AiO, aoffset ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B( br3, Br3 , boffset ) 
VPERM( ar3, atr3, atrO, apC ) 
VMADDFP ( rsumr, ar2, br2 , rsumr ) 
LOAD A( atrl, Arl, aoffset ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
VPERM( ai3, ati3, atiO, apC ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDI ( boffset, boffset, 64 ) 
LOAD A( atil, Ail, aoffset ) 
VMADDFP { isumr, ai2, br2, isumr ) 

/* } */ 

BR ( SUFFIX ( loop ) ) 

I ** 

windout section 
* */ 

LABEL ( SUFFIX (loop exit ) ) 

LOAD A( atrO, ArO, aoffset ) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( atiO, AiO, aoffset ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B( br3, Br 3, boffset ) 
VPERM( ar3, atr3, atrO, apC ) 
VMADDFP ( rsumr, ar2, br2 , rsumr ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
VPERM( ai3, ati3, atiO, apC ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDI ( boffset, boffset, 64 ) 
VMADDFP ( isumr, ai2, br2 , isumr ) 
VMADDFP ( rsumr, ar3, br3 , rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
VMADDFP ( isumi, ar3, bi3, isumi ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 

/** 

Remaining sum updates 
**/ 

LABEL ( SUFFIX (two_left) ) 

ANDI_C( count, N, 0x8 ) /* bit 3 */ 
BEQ ( SUFFIX (one_left ) ) 

LOAD B( brO, BrO, boffset ) 
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LOAD B( biO, BiO, boffset ) 
LOAD B( brl, Brl, boffset ) 
LOAD B( bil, Bil, boffset ) 
ADDI ( boffset, boffset, 32 ) 



LOAD A( atrO, ArO, aoffset ) 

LOAD A( atiO, AiO, aoffset ) 

LOAD A( atrl, Arl, aoffset ) 

LOAD A{ atil, Ail, aoffset ) 

LOAD A( atr2, Ar2 , aoffset ) 

LOAD A{ ati2, Ai2, aoffset ) 
ADDI ( aoffset, aoffset, 32 ) 



VPERM( arO, atrO, atrl, apC ) /* uses last pass value */ 

VPERM( aiO, atiO, atil, apC ) 

VPERM( arl, atrl, atr2, apC ) 

VPERM( ail, atil, ati2, apC ) 



VMADDFP { rsumr, 

VMADDFP ( rsumi, 

VMADDFP ( isumr, 

VMADDFP ( isumi, 



arO, brO, rsumr ) 

aiO, biO, rsumi ) 

aiO, brO, isumr ) 

arO, biO, isumi ) 



VMADDFP ( rsumr, 
VMADDFP ( isumr, 
VMADDFP ( rsumi , 
VMADDFP ( isumi, 
VMR(atr3, atrl) 
VMR(ati3, atil) 



arl, brl, rsumr ) 

ail, brl, isumr ) 

ail, bil, rsumi ) 

arl , bil , isumi ) 



LABEL ( SUFFIX (one_left) ) 

ANDI_C( count, N, 0x4 ) /* bit 2 */ 
BEQ ( SUFFIX (combine ) ) 



LOAD B( brO, BrO, boffset ) 
LOAD B( biO, BiO , boffset ) 
ADDI { boffset, boffset, 16 ) 



LOAD A( atrO, ArO, aoffset ) 

LOAD A( atiO, AiO, aoffset ) 

LOAD A (atrl, Arl, aoffset ) 

LOAD A( atil, Ail, aoffset ) 

ADDI ( aoffset, aoffset, 16 ) 



VPERM( arO, atrO, atrl, apC ) /* uses last pass value */ 
VPERM( aiO, atiO, atil, apC ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
VMADDFP ( isumi, arO, biO, isumi ) 

/ ** 

combine partial sums, permute, write out results 
**/ 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr = rsumr - rsumi */ 

VADDFP ( isumi, isumi, isumr ) 
/** 

8 bytes/cycle shuffle: 

real/imag logic should be intermixed for efficiency 

w 

VMRGHW ( rsumO , rsumr, rsumr) 
ANDI C( addr incr, N, 0x3 ) 
VMRGHW (isumO , isumi, isumi) 
VMRGLW ( rsumi , rsumr, rsumr) 

SUB( addr incr, N, addr incr ) /* offset index for remainders */ 
VMRGLW (isumi , isumi, isumi) 
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VADDFP ( rsumO, rsuml, rsumO ) 

SLWKaddr incr, addr incr, 2) /* byte offset */ 
VADDFP { isumO, isumi, isumO ) 

VMRGHW { rsuml , rsumO, rsumO) 
ADD (ArO , ArO, addr incr) 



VMRGHW (isuml, 
ADD (AiO, AiO, 
VMRGLW ( rsumO , 



isumO, isumO) 
addr incr) 
rsumO, rsumO) 



ADD (BrO, BrO # addr incr) 
VMRGLW ( isumO , isumO, isumO) 
ADD (BiO , BiO, addr incr) 
VADDFP { rsumr, rsuml, rsumO ) 
LI(coffset, 0) /* needed for output */ 
VADDFP ( isumi, isuml, isumO ) 
/** 

4 byte stores 
**/ 

STVEWX( rsumr, Cr, coffset ) 
STVEWX( isumi, Ci , coffset ) 
/* * 

Remainders of 1-3 more to do 
**/ 

ANDI_C( N, N, 3 ) 
LFS( rsum vmx, Cr, 0 ) 
LFS ( isum vmx, Ci, 0 ) 
BEQ ( SUFFIX ( scaler_vmx_combine ) ) 
/ * * 

Here to do last 1-3 points using standard FP 

w 

LABEL ( SUFFIX ( post_loop ) ) 



LFS{ far, ArO, 
LFS( fai, AiO, 
DECR_C( N ) 
LFS( fbr, BrO, 
LFS( fbi, BiO, 
FMADDS ( f rsumr, 
FMADDS ( frsumi, 
FMADDS ( f isumi, 



) 
) 

) 
) 

far, 
fai, 

far, fbi 



fbr, frsumr ) 
fbi, frsumi ) 
fisumi ) 



FMADDS ( fisumr, fai, fbr, fisumr ) 

ADDKArO, ArO, 4) 

ADDKBrO, BrO, 4) 

ADDI (AiO , AiO, 4) 

ADDKBiO, BiO, 4) 

BNE ( SUFFIX ( post_loop) ) 

/** 

Write out result 
**/ 

LABEL ( SUFFIX ( scaler vmx combine ) ) 

FSUBS ( frsum, frsumr, frsumi ) /* rsumr 

FADDS ( f isum, . fisumi , fisumr ) 

FADDS ( frsum, frsum, rsum vmx ) 

FADDS ( fisum, fisum, isum_vmx ) 

STFS( frsum, Cr, 0 ) 

STFS< fisum, Ci, 0 ) 

/** 

return 
* * i 

LABEL ( SUFFIX (ret) ) 

FREE THRU v28 ( VREGSAVE_COND ) 

REST rl3_r27 

RETURN 
FUNC EPILOG 



rsumr - rsumi 
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/* + 

MC Standard Algorithms PPC Macro language Version 



File Name: ZDOTPR4 VMX.MAC . 

Description: Vector Single Precision Complex Dot Product 

CPP dummy file for unaligned vector processing 

Entry/params: ZDOTPR4 VMX (A, I, B, J, C, N) 
Formula: C[0J = sum (A->realp [ml] *B->realp [mJ] 

- A->imagp [ml] *B->imagp [mJ] ) 
C[l] = sum (A- >realp [ml] *B- >imagp [mJ] 

+ A->imagp[mI] *B->realp[mJ] ) 
for m=0 to N- 1 

Mercury Computer Systems, Inc. 
Copyright <c) 1998 All rights reserved 

Revision Date Engineer Reason 



0.0 000607 fpl Created (from zdotpr vmx.mac) 

+ 

#if defined ( BUILD_MAX ) 

#undef VMX SAL 
#undef VMX NN 
tfundef VMX NC 
flundef VMX CN 
tfundef VMX_CC 

#if i defined < COMPILE_ESAL_JUMP TABLE ) 

/* 1 variant: _zdotpr4_vmx ( ) */ 

#define VMX SAL 

# include " zdotpr 4 _vmx .k" 

tfelse /* 5 variants based on ESAL flag */ 

#define VMX NN 

tfinclude "zdotpr4_vmx .k" 

#undef VMX NN 

^define VMX NC 

#include "zdotpr4_VTnx .k" 

tfundef VMX NC 

fldefine VMX CN 

tfinclude "zdotpr4_vmx .k" 

tfundef VMX CN 
tfdefine VMX CC 
^include "zdotpr4_vmx .k" 
#undef VMX_CC 

#endif /* end COMPI LE_ESAL_JUMP_TABLE */ 



tfendif 



/* end BUILD_MAX */ 
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MC Standard Algorithms PPC Macro language Version 



File Name: ZDOTPR.K 

Description: CPP Source code for Vector Single Precision 

Split Complex Dot Product 
Entry /params: ZDOTPR (A, I , B, J, C, N) 

ZIDOTPR (A, I, B, J, C, N) 



Formula: C{0] = sum (A~>realp [ml] *B->realp [mJ] 

-/+ A->imagp [ml] *B->imagp [mJ] ) 
C[l] = sum (A->realp [ml] *B->imagp [mJ] 

+/- A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 



Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 



Revision Date 



981215 
990310 
000131 
000223 
000717 



Engineer Reason 
fpl Created 

fpl Integrated with 750 library 

jfk salppc.inc changes 

fpl Fixed pre- loop bug 

fpl Added dsts, removed LVXLs 



#include "salppc.inc" 



/** 

ESAL CPP definitions 
**/ 

#undef FUNC CONJ ENTRY 

tfundef FUNC ENTRY 

tfundef LOAD A 

#undef LOAD B 

tfundef SUFFIX 



ttif defined ( VMX_SAL ) 



tfdefine FUNC ENTRY zdotpr vmx 

tfdefine FUNC CONJ ENTRY _zidotpr_vmx 

ttdefine LOAD A( vT, rA, rB ) LVX ( vT, rA, rB ) 

tfdefine LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define SUFFIX ( label ) label 

#undef DSTA { ptr, control ) 

flundef DSTB { ptr, control ) 

^define DSTA ( ptr, control ) 

tfdefine DSTB ( ptr, control ) 

#undef DST_ENABLE 

tfelif defined ( VMX_NN ) 



tfdefine FUNC ENTRY zdotpr vmx nn 

# define FUNC CONJ ENTRY _zidotpr_vmx_nn 

tfdefine LOAD A( vT, rA, rB ) LVX ( vT, rA, rB ) 

tfdefine LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 

tfdefine SUFFIX ( label ) label##_nn 

tfundef DSTA ( ptr, control ) 

tfundef DSTB ( ptr, control ) 

^define DSTA ( ptr, control ) 

tfdefine DSTB ( ptr, control ) 

flundef DST ENABLE 



#elif defined ( VMX_NC ) 

tfdefine FUNC_ENTRY _zdotpr_vmx_nc 
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#def ine 
#def ine 
#def ine 
Hdef ine 
tfundef 
ttundef 
#def ine 



FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B( vT, rA, 
SUFFIX ( label ) 



_zidotpr_vmx_nc 
rB ) LVX( vT, rA, rB ) 
rB ) LVX( vT, rA, rB ) 
label## nc 



DSTA ( ptr, control ) 
DSTB { ptr, control ) 
DSTA ( ptr, control ) 



tfdefine DSTB ( ptr, control ) 
#define DST ENABLE 



DST( ptr, control, 0 ) \ 
ADDI { ptr, ptr, 64 ) 



Helif defined ( VMX CN ) 



tfdef ine 
#def ine 
#def ine 
tfdef ine 
Jfdef ine 
flundef 
tfundef 
#define 
ttdefine 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B( vT, rA, 
SUFFIX ( label ) 



zdotpr vmx cn 
_z i dot pr_vmx_cn 
rB ) LVX{ vT, rA, rB ) 
rB ) LVX( vT, rA, rB ) 
label## cn 



DSTA ( ptr, 
DSTB ( ptr, 



control ) 
control ) 



DSTA ( ptr, control ) 
DSTB ( ptr, control ) 



#define DST ENABLE 



DST( ptr, control, 0 ) \ 
ADDI < ptr, ptr, 64 ) 



#elif defined ( VMX CC ) 



//#define FUNC ENTRY 
#define FUNC ENTRY 

FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B( vT, rA, 
SUFFIX { label ) 



#def ine 
#def ine 
#def ine 
#def ine 
#undef 
ftundef 
#def ine 
#def ine 
#undef 



zdotpr vmx_cc 
zdotpr vmx 
_ z i do t p r_vmx_c c 
rB ) LVX( vT, rA, 
rB ) LVX( vT, rA, 
label## cc 



DSTA ( ptr, 
DSTB ( ptr, 
DSTA ( ptr, 
DSTB ( ptr, 
DST ENABLE 



control 
control 
control 
control 



rB ) 
rB ) 



#else 
#error 



YOU MUST DEFINE VMX xxx, where x = C or N 



#endif 

#define VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 
I ** 

Local CPP definitions 
**/ 

#define NMASK2 0x8 

#define NMASK1 0x4 

#define NSHIFT 4 

#define ADDRESS INCREMENT 16 



/** 

Input args 
* ■* J 

#def ine 
#def ine 
#define B 
#define J 
tfdefine c 



A 
I 



r3 
r4 
r5 
r6 
r7 

^define N r8 
#define EFLAG r9 



Split complex parameters 
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#def ine 


ArO 


A 


ttdef ine 


AiO 


rlO 


fldef ine 


BrO 


B 


#def ine 


BiO 


rll 


#def ine 


Cr 


C 


#def ine 


Ci 


rl2 


/** 






Local registers 


w 






#def ine 


count r4 


#def ine 


rtmpO r4 


#def ine 


rtmpl rl3 


ffdef ine 


dst 


stride rl3 


#def ine 


num 


blocks rl4 


#def ine 


Arl" 


"rl3 


#def ine 


Ail 


rl4 


#def ine 


Ar2 


rl5 


^define 


Ai2 


rl6 


#define 


Ar3 


rl7 


^define 


Ai3 


rl8 


#def ine 


Brl 


rl9 


#def ine 


Bil 


r20 


#def ine 


Br2 


r21 


#def ine 


Bi2 


r22 


#def ine 


Br3 


r23 


#def ine 


Bi3 


r24 


#def ine 


ptr 


offsetO r25 


#def ine 


ptr 


offsetl r26 


#def ine 


addr incr r27 


#def ine 


dst 


rptr r28 


#def ine 


dst 


iptr r29 


#def ine 


dst 


_control r30 


/** 






VMX registers 


**/ 






#def ine 


rsumr vO 


#def ine 


rsumi vl 


#def ine 


isumr v2 


#def ine 


isumi v3 


#def ine 


rsumO v4 


#def ine 


rsuml v5 


#def ine 


isumO v6 


#def ine 


isuml v7 


#def ine 


arO 


v4 


#def ine 


aiO 


v5 


#def ine 


arl 


v6 


#def ine 


ail 


v7 


#def ine 


ar2 


v8 


#def ine 


ai2 


v9 


#def ine 


ar3 


vlO 


ttdef ine 


ai3 


vll 


#def ine 


brO 


vl2 


#def ine 


biO 


vl3 


#def ine 


brl 


vl4 


#def ine 


bil 


vl5 


#def ine 


br2 


vl6 


#def ine 


bi2 


vl7 


#def ine 


br3 


vl8 


#def ine 


bi3 


vl9 



/** 
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FPU registers 
**/ 

^define far fO 

^define fbr fl 

#define fai f2 

#define fbi f3 

tfdefine frsumr f4 

^define frsumi f5 

#define fisumi f6 

^define fisumr f7 

^define frsum f8 

fldefine fisum f9 
#define rsum vmx flO 
#define isum vmx fll 



/** 

Begin code text, Save some registers 
Here for conjugate inner product 

* */ 

' U_ENTRY( FUNC CONJ_ENTRY ) 
MRtrtmpO, Cr) 
MR(Cr, Ci) 
MR (Ci , rtmpO) 
MR(rtmpO, BrO) 
MR (BrO , BiO) 
MR (BiO , rtmpO) 

/** 

Here for normal inner product 
**/ 

U_ENTRY( FUNC ENTRY ) 
DECLARE fO fll 
DECLARE r3 r30 
DECLARE_vO__vl9 

I * * 

Initial setup code 
**/ 

SAVE rl3 r30 

USE THRU vl9( VREGSAVE_COND ) 

LFS ( frsumr, ArO, 0 ) 

FSUBS (frsumr , frsumr, frsumr) 

FMR( frsumi, frsumr) 

FMR (fisumr, frsumr) 

FMR (fisumi, frsumr) 

FMR (rsum vmx, frsumr) 

FMR (isum vmx, frsumr) 

/** 

Process unaligned vector section first 

* * I 

LABEL ( SUFFIX ( cont ) ) 

GET_VMX UNALIGNED COUNT ( count, ArO ) 
LI( ptr offsetO, 0 ) 
BEQ ( SUFFIX ( aligned ) ) 

SUB( N, N, count ) /* adjust N for after loop */ 

I * * 

Here to do first 1 to 3 points using standard FP 
Store result for later post_loop processing 
W 

LFSX( far, ArO , ptr offsetO ) 
LFSX( fai, AiO, ptr^offsetO ) 
DECR C( count ) 
LFSX( fbr, BrO, ptr offsetO ) 
LFSX( fbi, BiO, ptr^offsetO ) 
FMULS( frsumr, far, fbr ) 
FMULS( frsumi, fai, fbi ) 
FMULS( fisumi, far, fbi ) 
FMULS( fisumr, fai, fbr ) 
ADDI ( ArO, ArO, 4 ) 
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ptr offsetO ) 
ptr offsetO ) 
far, fbr, frsumr ) 



ADDI ( AiO, AiO, 4 ) 
ADDI ( BrO, BrO, 4 ) 
ADDI ( BiO, BiO, 4 ) 

BEQ( SUFFIX ( aligned ) ) 
/** 

Loop does 1 or 2 more sum updates 
**/ 

LABEL ( SUFFIX ( pre_loop ) ) 

LFSX( far, ArO, ptr offsetO ) 
LFSX( fai, AiO , ptr_offsetO ) 
DECR C( count ) 
LFSX( fbr, BrO, 
LFSX( fbi, BiO, 
FMADDS ( frsumr, 
ADDI ( ArO, ArO, 4 ) 
FMADDS ( frsumi, fai, fbi, frsumi ) 
ADDI ( AiO, AiO, 4 ) 
FMADDS ( fisumi, far, fbi, fisumi ) 
ADDI ( BrO, BrO, 4 ) 
FMADDS ( fisumr, fai, fbr, fisumr ) 
ADDI ( BiO, BiO, 4 ) 
BNE { SUFFIX ( pre loop) ) 

/** 

Here for VMX aligned loop code 

Prepare for loop entry: assign loop pointers, counters 
* * i 

LABEL ( SUFFIX ( aligned ) ) 
/** 

DST setup: bring in 2 cachelines 

MAKE STREAM_CODE( control_register , bytes_per_block, block_count, 
byte_stride ) 
**/ 

#if defined! DST_ENABLE ) 

#if defined! EXPAND_NCC ) 

MR ( dst rptr, Ar ) 

MR ( dst iptr, Ai ) 
#elif defined! EXPAND_CNC ) 

MR ( dst rptr, Br ) 

MR ( dst_iptr, Bi ) 
#endif 



MAKE STREAM CODE ( dst control, 64, 1, 0 ) 
DSTA ( dst rptr, dst control ) 
DSTA ( dst iptr, dst control ) 
DSTB { dst rptr, dst control ) 
DSTB { dst_iptr, dst_control ) 
# end if 

SRWI C( count, N, NSHIFT ) /* 16 per trip */ 

LKaddr incr, ADDRESS INCREMENT) /* constants defined above */ 

SLWI (ptr offsetl, addr incr, 2) 

NEG(ptr_of fsetl, ptr_offsetl) /* will be adding addrincr << 3 */ 

ADD ( Arl , ArO, addr incr) 
VXOR( rsumr, rsumr, rsumr ) 
ADD (Brl , BrO, addr incr) 
ADD (Ail , AiO, addr incr) 
VXOR ( rsumi , rsumi , rsumi ) 
ADD (Bil , BiO, addr incr) 



ADD (Ar2 , Arl, 
VXOR( isumr, 
ADD {Br 2, Brl, 
ADD (Ai2 , Ail, 
VXOR( isumi, 



addr incr) 
isumr, isumr ) 
addr incr) 
addr incr) 
isumi, isumi ) 



ADD (Bi2 , Bil, addr incr) 
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ADD (Ar3 , Ar2 , addr incr) 
ADD (Br3 , Br2, addr incr) 
ADD (Ai3 , Ai2, addr incr) 
ADD (Bi3 , Bi2, addr incr) 

SLWI (addr_incr, addr_incr, 3) /* bump by 8 elements */ 

/** 

Loop entry code 
**/ 

DSTA ( dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offsetO ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD_B( biO, BiO, ptr_offsetO ) 

/** 

Top of double loop structure 

★ */ 

LABEL ( SUFFIX (loopO ) ) 

LOAD A( arl, Arl # ptr offsetO ) 
VMADDFP ( rsumr, arO , brO, rsumr ) 
DSTA ( dst iptr, dst control ) 
LOAD B{ brl, Brl, ptr offsetO ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ail, Ail, ptr offsetO ) 
LOAD B( bil, Bil, ptr offsetO ) 
DSTB ( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2 , ptr offsetO ) 
VMADDFP ( isumi, arO, bi0 # isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( br2, Br2 , ptr offsetO ) 
VMADDFP ( rsumr, arl, brl, rsumr ) 
ADD (ptr offsetl, ptr_offsetl, addr_incr) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( ai2, Ai2, ptr offsetO ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B{ bi2, Bi2, ptr offsetO ) 
VMADDFP ( isumr, ail, brl, isumr ) 
VMADDFP ( rsumr, ar2, br2 , rsumr ) 
LOAD A{ ar3, Ar3 , ptr offsetO ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
LOAD B( br3, Br3, ptr offsetO ) 
LOAD A( ai3, Ai3, ptr offsetO ) 
VMADDFP ( isumi, ar2 , bi2, isumi ) 
LOAD B( bi3, Bi3, ptr offsetO ) 
VMADDFP ( isumr, ai2, br2 , isumr ) 
BEQ ( SUFFIX (loopO exit ) ) 
DSTA { dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offsetl ) 
VMADDFP ( rsumr, ar3 , br3 , rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetl ) 
VMADDFP ( isumi, ar3 , bi3, isumi ) 
LOAD A( aiO, AiO, ptr offsetl ) 
LOAD B{ biO, BiO, ptr offsetl ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 
BR { SUFFIX (loopl ) ) 

/* * 

loop exit 

★ * f 

LABEL ( SUFFIX (loopO exit ) ) 
MR(ptr offsetO, ptr offsetl) 
BR ( SUFFIX (loopl_exit ) ) 

/** 

Top of second loop 
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LABEL ( SUFFIX (loopl ) ) 

LOAD A{ arl, Arl , ptr offsetl ) 
VMADDFP ( rsumr, arO, brO, rsumr ) 
DSTA ( dst iptr, dst control ) 
LOAD B( brl, Brl # ptr offsetl ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ail, Ail, ptr offsetl ) 
LOAD B( bil, Bil, ptr offsetl ) 
DSTB ( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2 , ptr offsetl ) 
VMADDFP ( isumi, arO, biO, isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( br2, Br2, ptr offsetl ) 
VMADDFP ( rsumr, arl, brl, rsumr ) 
ADD (ptr offsetO, ptr_offsetO, addr_incr) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( ai2, Ai2, ptr offsetl ) 
VMADDFP < isumi, arl, bil, isumi ) 
LOAD B( bi2, Bi2, ptr offsetl ) 
VMADDFP ( isumr, ail, brl, isumr ) 
VMADDFP ( rsumr, ar2, br2, rsumr ) 
LOAD A( ar3, Ar3, ptr offsetl ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
LOAD B( br3, Br3, ptr offsetl ) 
LOAD A( ai3, Ai3, ptr offsetl ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, ptr offsetl ) 
VMADDFP ( isumr, ai2, br2, isumr ) 
BEQ( SUFFIX (loopl exit ) ) 
DSTA ( dst rptr, dst control ) 
LOAD A( arO, ArO, ptr off set 0 ) 
VMADDFP ( rsumr, ar3, br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
VMADDFP ( isumi, ar3 , bi3, isumi ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD B( biO, BiO, ptr offsetO ) 
VMADDFP ( isumr, ai3, br3, isumr ) 
BR ( SUFFIX (loopO ) ) 

/** 

Drop out of loop, flush pipe 

* */ 

LABEL ( SUFFIX (loopl exit ) ) 

VMADDFP ( rsumr, ar3, br3 , rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
VMADDFP ( isumi, ar3, bi3, isumi ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 

/** 

Remaining sum updates 

* + / 

LABEL ( SUFFIX (two_left) ) 

ANDI_C( count, N, 0x8 ) /* bit 3 */ 
BEQ ( SUFFIX (one_left ) ) 

LOAD A( arO, ArO, ptr offsetO ) 

LOAD B( brO, BrO, ptr offsetO ) 

LOAD A( aiO, AiO, ptr offsetO ) 

LOAD_B ( biO, BiO, ptr_offsetO ) 

LOAD A( arl, Arl, ptr offsetO ) 

LOAD B( brl, Brl, ptr offsetO ) 

LOAD A( ail, Ail, ptr offsetO ) 

LOAD_B ( bil, Bil, ptr_offsetO ) 
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VMADDFP ( rsumr, arO, brO, rsumr ) 

VMADDFP ( rsumi, aiO, biO, rsumi ) 

VMADDFP ( isumi, arO; biO, isumi ) 

VMADDFP { isumr, aiO, brO, isumr ) 

VMADDFP ( rsumr, arl, brl , rsumr ) 

VMADDFP { rsumi, ail, bil, rsumi ) 

VMADDFP ( isumi, arl, bil, isumi ) 

VMADDFP ( isumr, ail, brl, isumr ) 
ADDI ( ptr_of f setO, ptr_of f setO , 32 ) 



LABEL ( SUFFIX (oneJLeft) ) 

ANDI_C( count, N, 0x4 ) /* bit 2 */ 
BEQ { SUFFIX (combine ) ) 
LOAD A( arO, ArO, ptr offsetO ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD B( biO, BiO, ptr offsetO ) 
VMADDFP ( rsumr, arO, brO, rsumr ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
VMADDFP ( isumi, arO, biO, isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
ADDI ( ptr_offsetO, ptr_of fsetO, 16 ) 

/** 

combine partial sums, permute, write out results 
**/ 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr = rsumr - rsumi */ 

VADDFP( isumi, isumi, isumr ) 
/** 

8 bytes/cycle shuffle: 

real/imag logic should be intermixed for efficiency 
* ★/ 

VMRGHW (rsumO , rsumr, rsumr) 
AND I C{ addr incr, N, 0x3 ) 
VMRGHW ( isumO , isumi, isumi) 
VMRGLW (rsumi , rsumr, rsumr) 

SUB( addr incr, N, addr incr ) /* offset index for remainders */ 
VMRGLW (isumi , isumi, isumi) 
VADDFP ( rsumO, rsumi, rsumO ) 

SLWI (addr incr, addr incr, 2) /* byte offset */ 
VADDFP ( i sumO , i suml , i sumO ) 



VMRGHW (rsumi 
ADD (ArO, ArO 
VMRGHW (isumi, 



rsumO, rsumO) 
addr incr) 
isumO, isumO) 



ADD (AiO, AiO, addr incr) 
VMRGLW ( rsumO , rsumO, rsumO) 



ADD (BrO , BrO, 
VMRGLW (isumO, 



addr incr) 
isumO, isumO) 
ADD (BiO, BiO, addr incr) 
VADDFP ( rsumr, rsumi, rsumO ) 
LI (ptr offsetO, 0) /* needed for output */ 
VADDFP ( isumi, isumi, isumO ) 
/** 

4 byte stores 

* */ 

STVEWX( rsumr, Cr, ptr offsetO ) 
STVEWX( isumi, Ci , ptr_offset0 ) 
I * * 

Remainders of 1-3 more to do 

* */ 

ANDI_C( N, N, 3 ) 

LFS ( rsum vmx, Cr, 0 ) 

LFS( isum vmx, Ci , 0 ) 

BEQ ( SUFFIX ( scaler_vmx_combine ) ) 
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Here to do last 1-3 points using standard FP 
**/ 

LABEL ( SUFFIX ( post_loop ) ) 
LFS( far, ArO, 0 ) 
LFS( fai, AiO, 0 ) 
DECR_C( N ) 
LFS( fbr, BrO f 0 ) 
LFS( fbi, BiO, 0 ) 
FMADDS ( frsumr, far, fbr, frsumr ) 
FMADDS ( frsumi, fai, fbi, frsumi ) 
FMADDS ( fisumi, far, fbi, fisumi ) 
FMADDS ( fisumr, fai, fbr, fisumr ) 
ADDI (ArO , ArO, 4) 
ADDKBrO, BrO, 4) 
ADDI (AiO, AiO, 4) 
ADDI (BiO, BiO, 4) 
BNE ( SUFFIX ( post_100p) ) 

/** 

Write out result 
**/ 

LABEL ( SUFFIX ( scaler vmx combine ) ) 

FSUBS( frsum, frsumr, frsumi ) /* rsumr 
FADDS ( fisum, fisumi, fisumr ) 
FADDS ( frsum, frsum, rsum vmx ) 
FADDS { fisum, fisum, isum vmx ) 



rsumr - rsumi */ 



/ 



STFS( frsum, Cr, 0 ) 
STFS( fisum, Ci, 0 ) 
* * 



return 
* */ 

LABEL ( SUFFIX (ret) ) 

FREE THRU vl9( VREGSAVE_COND ) 

REST rl3_r3 0 

RETURN 
FUNC EPILOG 
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•-- MC Standard Algorithms PPC Macro language Version 



File Name: 
Description : 
En try /pa rams: 
Formula: CtO] 

C[l] 



ZDOTPR. MAC 

Vector Single Precision Complex Dot Product 

ZDOTPR (A, I, B, J, C, N) 

= sum (A->realp[mI] *B->realp [mJ] 

- A->imagp [ml] *B->imagp [mJ] ) 
= sum (A->realp [ml] *B->imagp [mJ] 

+ A->imagp[mI] *B- >realp [mJ] ) 
for m=0 to N-l 



Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 



Revision Date Engineer Reason 



0.0 981209 fpl Created (from cdotpr.mac) 
0.1 990310 fpl 750/G4 integration 
0.1 990322 fpl Stylistic changes 
+ */ 



#define COM P I L E_E S AL_ JUM P_TAB L E 

tfdefine FUNCJTYPE ZDOTPR 

#if defined { BUILD_MAX ) 

#undef VMX SAL 
ffundef VMX NN 
#undef VMX NC 
#undef VMX CN 
#undef VMX_CC 

#if !defined( COMPILE ESAL_JUMP_TABLE ) || defined( 
COMPILE_NO_ESAL_JUMP_TABLE ) 

/* 1 variant: _zdotpr_vmx ( ) */ 

#define VMX SAL 
#include "zdotpr_vmx.k M 

#else /* 5 variants based on ESAL flag */ 

tfdefine VMX NN 
#include "zdotpr_vmx.k" 

tfundef VMX NN 

tfdefine VMX NC 

# include "zdotpr_vmx . k" 

tfundef VMX NC 
#define VMX CN 
^include "zdotpr^vmx.k" 

tfundef VMX CN 
tfdefine VMX CC 
#include "zdotp^vmx.k" 
#undef VMX_CC 

#endif /* end COMPILE_ESAL_JUMP_TABLE */ 

tfendif /* end BUILD_MAX */ 
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